TY - GEN
T1 - RDIP
T2 - 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2013
AU - Kolli, Aasheesh
AU - Saidi, Ali
AU - Wenisch, Thomas F.
PY - 2013
Y1 - 2013
N2 - L1 instruction fetch misses remain a critical performance bottleneck, accounting for up to 40% slowdowns in server applications. Whereas instruction footprints typically fit within last-level caches, they overwhelm L1 caches, whose capacity is limited by latency constraints. Past work has shown that server application instruction miss sequences are highly repetitive. By recording, indexing, and prefetching according to these sequences, nearly all L1 instruction misses can be eliminated. However, existing schemes require impractical storage and considerable complexity to correct for minor control-flow variations that disrupt sequences. In this work, we simplify and reduce the energy requirements of accurate instruction prefetching via two observations: (1) program context as captured in the call stack correlates strongly with L1 instruction misses, and (2) the return address stack (RAS), already present in all high performance processors, succinctly summarizes program context. We propose RAS-Directed Instruction Prefetching (RDIP), which associates prefetch operations with signatures formed from the contents of the RAS. RDIP achieves 70% of the potential speedup of an ideal L1 cache, outperforms a prefetcherless baseline by 11.5% and reduces energy and complexity relative to sequence-based prefetching. RDIP's performance is within 2% of the state-of-the-art Proactive Instruction Fetch, with nearly 3X reduction in storage and 1.9X reduction in energy overheads.
AB - L1 instruction fetch misses remain a critical performance bottleneck, accounting for up to 40% slowdowns in server applications. Whereas instruction footprints typically fit within last-level caches, they overwhelm L1 caches, whose capacity is limited by latency constraints. Past work has shown that server application instruction miss sequences are highly repetitive. By recording, indexing, and prefetching according to these sequences, nearly all L1 instruction misses can be eliminated. However, existing schemes require impractical storage and considerable complexity to correct for minor control-flow variations that disrupt sequences. In this work, we simplify and reduce the energy requirements of accurate instruction prefetching via two observations: (1) program context as captured in the call stack correlates strongly with L1 instruction misses, and (2) the return address stack (RAS), already present in all high performance processors, succinctly summarizes program context. We propose RAS-Directed Instruction Prefetching (RDIP), which associates prefetch operations with signatures formed from the contents of the RAS. RDIP achieves 70% of the potential speedup of an ideal L1 cache, outperforms a prefetcherless baseline by 11.5% and reduces energy and complexity relative to sequence-based prefetching. RDIP's performance is within 2% of the state-of-the-art Proactive Instruction Fetch, with nearly 3X reduction in storage and 1.9X reduction in energy overheads.
UR - http://www.scopus.com/inward/record.url?scp=84892524803&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84892524803&partnerID=8YFLogxK
U2 - 10.1145/2540708.2540731
DO - 10.1145/2540708.2540731
M3 - Conference contribution
AN - SCOPUS:84892524803
SN - 9781450326384
T3 - MICRO 2013 - Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
SP - 260
EP - 271
BT - MICRO 2013 - Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Y2 - 7 December 2013 through 11 December 2013
ER -