TY - GEN
T1 - Enhancing computation-to-core assignment with physical location information
AU - Kislal, Orhan
AU - Kotra, Jagadish
AU - Tang, Xulong
AU - Kandemir, Mahmut Taylan
AU - Jung, Myoungsoo
N1 - Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/6/11
Y1 - 2018/6/11
N2 - Going beyond a certain number of cores in modern architectures requires an on-chip network more scalable than conventional buses. However, employing an on-chip network in a manycore system (to improve scalability) makes the latencies of the data accesses issued by a core non-uniform. This non-uniformity can play a significant role in shaping the overall application performance. This work presents a novel compiler strategy which involves exposing architecture information to the compiler to enable an optimized computation-to-core mapping. Specifically, we propose a compiler-guided scheme that takes into account the relative positions of (and distances between) cores, last-level caches (LLCs) and memory controllers (MCs) in a manycore system, and generates a mapping of computations to cores with the goal of minimizing the on-chip network traffic. The experimental data collected using a set of 21 multi-threaded applications reveal that, on an average, our approach reduces the on-chip network latency in a 6×6 manycore system by 38.4% in the case of private LLCs, and 43.8% in the case of shared LLCs. These improvements translate to the corresponding execution time improvements of 10.9% and 12.7% for the private LLC and shared LLC based systems, respectively.
AB - Going beyond a certain number of cores in modern architectures requires an on-chip network more scalable than conventional buses. However, employing an on-chip network in a manycore system (to improve scalability) makes the latencies of the data accesses issued by a core non-uniform. This non-uniformity can play a significant role in shaping the overall application performance. This work presents a novel compiler strategy which involves exposing architecture information to the compiler to enable an optimized computation-to-core mapping. Specifically, we propose a compiler-guided scheme that takes into account the relative positions of (and distances between) cores, last-level caches (LLCs) and memory controllers (MCs) in a manycore system, and generates a mapping of computations to cores with the goal of minimizing the on-chip network traffic. The experimental data collected using a set of 21 multi-threaded applications reveal that, on an average, our approach reduces the on-chip network latency in a 6×6 manycore system by 38.4% in the case of private LLCs, and 43.8% in the case of shared LLCs. These improvements translate to the corresponding execution time improvements of 10.9% and 12.7% for the private LLC and shared LLC based systems, respectively.
UR - https://www.scopus.com/pages/publications/85049567574
UR - https://www.scopus.com/inward/citedby.url?scp=85049567574&partnerID=8YFLogxK
U2 - 10.1145/3192366.3192386
DO - 10.1145/3192366.3192386
M3 - Conference contribution
AN - SCOPUS:85049567574
T3 - Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)
SP - 312
EP - 327
BT - PLDI 2018 - Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation
A2 - Foster, Jeffrey S.
A2 - Grossman, Dan
A2 - Foster, Jeffrey S.
PB - Association for Computing Machinery
T2 - 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018
Y2 - 18 June 2018 through 22 June 2018
ER -