TY - GEN
T1 - Compiler Support for Optimizing Memory Bank-Level Parallelism
AU - Ding, Wei
AU - Guttman, Diana
AU - Kandemir, Mahmut
PY - 2015/1/15
Y1 - 2015/1/15
N2 - Many prior compiler-based optimization schemes focused exclusively on cache data locality. However, cache locality is only one part of the overall performance of applications running on emerging multicores or many cores. For example, memory stalls could constitute a very large fraction of execution time even in cache-optimized codes, and one of the main reasons for this is lack of memory-level parallelism. Motivated by this, we propose a compiler-based Bank-Level Parallelism (BLP) optimization scheme that uses loop tile scheduling. More specifically, we first use Cache Miss Equations to predict where the last-level cache miss will happen in each tile, and then identify the set of memory banks that will be accessed in each tile. Using this information, two tile scheduling algorithms are proposed to maximize BLP, each targeting a different scenario. We further discuss how our compiler-based scheme can be enhanced to consider memory controller-level parallelism and row-buffer locality. Our experimental evaluation using 11 multithreaded applications shows that the proposed BLP optimization can improve average BLP by 17.1% on average, resulting in a 9.2% reduction in average memory access latency. Furthermore, considering memory controller-level parallelism and row-buffer locality (in addition to BLP) takes our average improvement in memory access latency to 22.2%.
AB - Many prior compiler-based optimization schemes focused exclusively on cache data locality. However, cache locality is only one part of the overall performance of applications running on emerging multicores or many cores. For example, memory stalls could constitute a very large fraction of execution time even in cache-optimized codes, and one of the main reasons for this is lack of memory-level parallelism. Motivated by this, we propose a compiler-based Bank-Level Parallelism (BLP) optimization scheme that uses loop tile scheduling. More specifically, we first use Cache Miss Equations to predict where the last-level cache miss will happen in each tile, and then identify the set of memory banks that will be accessed in each tile. Using this information, two tile scheduling algorithms are proposed to maximize BLP, each targeting a different scenario. We further discuss how our compiler-based scheme can be enhanced to consider memory controller-level parallelism and row-buffer locality. Our experimental evaluation using 11 multithreaded applications shows that the proposed BLP optimization can improve average BLP by 17.1% on average, resulting in a 9.2% reduction in average memory access latency. Furthermore, considering memory controller-level parallelism and row-buffer locality (in addition to BLP) takes our average improvement in memory access latency to 22.2%.
UR - http://www.scopus.com/inward/record.url?scp=84937713823&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84937713823&partnerID=8YFLogxK
U2 - 10.1109/MICRO.2014.34
DO - 10.1109/MICRO.2014.34
M3 - Conference contribution
T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO
SP - 571
EP - 582
BT - Proceedings - 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2014
PB - IEEE Computer Society
T2 - 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2014
Y2 - 13 December 2014 through 17 December 2014
ER -