TY - GEN
T1 - Optimizing off-chip accesses in multicores
AU - Ding, Wei
AU - Tang, Xulong
AU - Kandemir, Mahmut
AU - Zhang, Yuanrui
AU - Kultursay, Emre
PY - 2015/6/3
Y1 - 2015/6/3
N2 - In a network-on-chip (NoC) based manycore architecture, an offchip data access (main memory access) needs to travel through the on-chip network, spending considerable amount of time within the chip (in addition to the memory access latency). In addition, it contends with on-chip (cache) accesses as both use the same NoC resources. In this paper, focusing on data-parallel, multithreaded applications, we propose a compiler-based off-chip data access localization strategy, which places data elements in the memory space such that an off-chip access traverses a minimum number of links (hops) to reach the memory controller that handles this access. This brings three main benefits. First, the network latency of off-chip accesses gets reduced; second, the network latency of onchip accesses gets reduced; and finally, the memory latency of offchip accesses improves, due to reduced queue latencies. We present an experimental evaluation of our optimization strategy using a set of 13 multithreaded application programs under both private and shared last-level caches. The results collected emphasize the importance of optimizing the off-chip data accesses.
AB - In a network-on-chip (NoC) based manycore architecture, an offchip data access (main memory access) needs to travel through the on-chip network, spending considerable amount of time within the chip (in addition to the memory access latency). In addition, it contends with on-chip (cache) accesses as both use the same NoC resources. In this paper, focusing on data-parallel, multithreaded applications, we propose a compiler-based off-chip data access localization strategy, which places data elements in the memory space such that an off-chip access traverses a minimum number of links (hops) to reach the memory controller that handles this access. This brings three main benefits. First, the network latency of off-chip accesses gets reduced; second, the network latency of onchip accesses gets reduced; and finally, the memory latency of offchip accesses improves, due to reduced queue latencies. We present an experimental evaluation of our optimization strategy using a set of 13 multithreaded application programs under both private and shared last-level caches. The results collected emphasize the importance of optimizing the off-chip data accesses.
UR - http://www.scopus.com/inward/record.url?scp=84951827362&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84951827362&partnerID=8YFLogxK
U2 - 10.1145/2737924.2737989
DO - 10.1145/2737924.2737989
M3 - Conference contribution
AN - SCOPUS:84951827362
T3 - Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)
SP - 131
EP - 142
BT - PLDI 2015 - Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation
A2 - Blackburn, Steve
A2 - Grove, David
PB - Association for Computing Machinery
T2 - 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2015
Y2 - 13 June 2015 through 17 June 2015
ER -