TY - GEN
T1 - Co-optimizing memory-level parallelism and cache-level parallelism
AU - Tang, Xulong
AU - Karakoy, Mustafa
AU - Kandemir, Mahmut Taylan
AU - Arunachalam, Meenakshi
N1 - Funding Information:
The authors thank PLDI reviewers for their constructive feedback, and Jennifer B. Sartor, for shepherding this paper. This research is supported in part by NSF grants #1526750, #1763681, #1439057, #1439021, #1629129, #1409095, #1626251, #1629915, and a grant from Intel.
Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/6/8
Y1 - 2019/6/8
N2 - Minimizing cache misses has been the traditional goal in optimizing cache performance using compiler based techniques. However, continuously increasing dataset sizes combined with large numbers of cache banks and memory banks connected using on-chip networks in emerging manycores/accelerators makes cache hitśmiss latency optimization as important as cache miss rate minimization. In this paper, we propose compiler support that optimizes both the latencies of last-level cache (LLC) hits and the latencies of LLC misses. Our approach tries to achieve this goal by improving the parallelism exhibited by LLC hits and LLC misses. More speciically, it tries to maximize both cache-level parallelism (CLP) and memory-level parallelism (MLP). This paper presents diferent incarnations of our approach, and evaluates them using a set of 12 multithreaded applications. Our results indicate that (i) optimizing MLP irst and CLP later brings, on average, 11.31% performance improvement over an approach that already minimizes the number of LLC misses, and (ii) optimizing CLP irst and MLP later brings 9.43% performance improvement. In comparison, balancing MLP and CLP brings 17.32% performance improvement on average.
AB - Minimizing cache misses has been the traditional goal in optimizing cache performance using compiler based techniques. However, continuously increasing dataset sizes combined with large numbers of cache banks and memory banks connected using on-chip networks in emerging manycores/accelerators makes cache hitśmiss latency optimization as important as cache miss rate minimization. In this paper, we propose compiler support that optimizes both the latencies of last-level cache (LLC) hits and the latencies of LLC misses. Our approach tries to achieve this goal by improving the parallelism exhibited by LLC hits and LLC misses. More speciically, it tries to maximize both cache-level parallelism (CLP) and memory-level parallelism (MLP). This paper presents diferent incarnations of our approach, and evaluates them using a set of 12 multithreaded applications. Our results indicate that (i) optimizing MLP irst and CLP later brings, on average, 11.31% performance improvement over an approach that already minimizes the number of LLC misses, and (ii) optimizing CLP irst and MLP later brings 9.43% performance improvement. In comparison, balancing MLP and CLP brings 17.32% performance improvement on average.
UR - http://www.scopus.com/inward/record.url?scp=85067638402&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85067638402&partnerID=8YFLogxK
U2 - 10.1145/3314221.3314599
DO - 10.1145/3314221.3314599
M3 - Conference contribution
AN - SCOPUS:85067638402
T3 - Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)
SP - 935
EP - 949
BT - PLDI 2019 - Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation
A2 - McKinley, Kathryn S.
A2 - Fisher, Kathleen
PB - Association for Computing Machinery
T2 - 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019
Y2 - 22 June 2019 through 26 June 2019
ER -