TY - GEN
T1 - A data layout optimization framework for NUCA-based multicores
AU - Zhang, Yuanrui
AU - Ding, Wei
AU - Kandemir, Mahmut
AU - Liu, Jun
AU - Jang, Ohyoung
PY - 2011
Y1 - 2011
N2 - Future multicore architectures are likely to include a large number of cores connected using an on-chip network with Non-uniform Cache Access (NUCA). In such architectures, whether a data request is satisfied from a local cache or a remote cache can make an important difference. To exploit this NUCA property, prior research explored both architectural enhancements as well as compiler-based code optimization strategies. In this work, we take an alternate view, and explore data layout optimizations to improve locality of data accesses in a NUCA-based system. Our proposed approach includes three steps: array tiling, computation-to-core mapping, and layout customization. The first of these tries to identify the affinity between data and computation taking into account parallelization information, with the goal of minimizing remote accesses. The second step maps computations (and their associated data) to cores with the goal of minimizing average distance-to-data, and the last step further customizes the memory layout taking into account the data placement policy adopted by the underlying architecture. We evaluated the success of this three-step approach in enhancing on-chip cache behavior using all application programs from the SPECOMP suite on a full-system simulator. Our results show that the proposed approach improves on average data access latency and execution time by 24.7% and 18.4%, respectively, in the case of static NUCA, and 18.1% and 12.7%, respectively, in the case of dynamic NUCA.
AB - Future multicore architectures are likely to include a large number of cores connected using an on-chip network with Non-uniform Cache Access (NUCA). In such architectures, whether a data request is satisfied from a local cache or a remote cache can make an important difference. To exploit this NUCA property, prior research explored both architectural enhancements as well as compiler-based code optimization strategies. In this work, we take an alternate view, and explore data layout optimizations to improve locality of data accesses in a NUCA-based system. Our proposed approach includes three steps: array tiling, computation-to-core mapping, and layout customization. The first of these tries to identify the affinity between data and computation taking into account parallelization information, with the goal of minimizing remote accesses. The second step maps computations (and their associated data) to cores with the goal of minimizing average distance-to-data, and the last step further customizes the memory layout taking into account the data placement policy adopted by the underlying architecture. We evaluated the success of this three-step approach in enhancing on-chip cache behavior using all application programs from the SPECOMP suite on a full-system simulator. Our results show that the proposed approach improves on average data access latency and execution time by 24.7% and 18.4%, respectively, in the case of static NUCA, and 18.1% and 12.7%, respectively, in the case of dynamic NUCA.
UR - http://www.scopus.com/inward/record.url?scp=84863354212&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84863354212&partnerID=8YFLogxK
U2 - 10.1145/2155620.2155677
DO - 10.1145/2155620.2155677
M3 - Conference contribution
AN - SCOPUS:84863354212
SN - 9781450310536
T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO
SP - 489
EP - 500
BT - MICRO 44 - Proceedings of the 44th Annual IEEE/ACM Symposium on Microarchitecture
T2 - 44th Annual IEEE/ACM Symposium on Microarchitecture, MICRO 44
Y2 - 4 December 2011 through 7 December 2011
ER -