TY - JOUR
T1 - Optimization of Intercache Traffic Entanglement in Tagless Caches with Tiling Opportunities
AU - Swamy Saranam Chongala, S. R.
AU - George, Sumitha
AU - Govindarajan, Hariram Thirucherai
AU - Kotra, Jagadish
AU - Mutyam, Madhu
AU - Sampson, John
AU - Kandemir, Mahmut T.
AU - Narayanan, Vijaykrishnan
N1 - Funding Information:
He is a Researcher with AMD Research, Austin, TX, USA. His research interests are in the areas of computer architecture, operating systems, and hardware–software co-design. At AMD, he is work-ing on designing and optimizing exascale systems as part of the Pathforward project funded by the Department of Energy.
Funding Information:
Manuscript received April 18, 2020; revised June 12, 2020; accepted July 6, 2020. Date of publication October 2, 2020; date of current version October 27, 2020. This work was supported in part by the Semiconductor Research Corporation JUMP Center for Research in Intelligent Storage and Processing in Memory; and in part by NSF under Grant 1763681, Grant 1629129, Grant 1931531, Grant 1629915, and Grant 1908793. This article was presented in the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems 2020 and appears as part of the ESWEEK-TCAD special issue. (Corresponding author: Sumitha George.) S. R. Swamy Saranam Chongala and Madhu Mutyam are with the Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600036, India (e-mail: srswamy@cse.iitm.ac.in; madhu@cse.iitm.ac.in).
Publisher Copyright:
© 1982-2012 IEEE.
PY - 2020/11
Y1 - 2020/11
N2 - So-called 'tagless' caches have become common as a means to deal with the vast L4 last-level caches (LLCs) enabled by increasing device density, emerging memory technologies, and advanced integration capabilities (e.g., 3-D). Tagless schemes often result in intercache entanglement between tagless cache (L4) and the cache (L3) stewarding its metadata. We explore new cache organization policies that mitigate overheads stemming from the intercache-level replacement entanglement. We incorporate support for explicit tiling shapes that can better match software access patterns to improve the spatial and temporal locality of large block allocations in many essential computational kernels. To address entanglement overheads and pathologies, we propose new replacement policies and energy-friendly mechanisms for tagless LLCs, such as restricted block caching (RBC) and victim tag buffer caching (VBC) to incorporate L4 eviction costs into L3 replacement decisions efficiently. We evaluate our schemes on a range of linear algebra kernels that are software tiled. RBC and VBC demonstrate a reduction in memory traffic of 83/4.4/67% and 69/35.5/76% for 8/32/64 MB L4s, respectively. Besides, RBC and VBC provide speedups of 16/0.3/0.6% and 15.7/1.8/0.8%, respectively, for systems with 8/32/64 MB L4, over a tagless cache with an LRU policy in the L3. We also show that matching the shape of the hardware allocation for each tagless region superblocks to the access order of the software tile improves latency by 13.4% over the baseline tagless cache with reductions in memory traffic of 51% over linear superblocks.
AB - So-called 'tagless' caches have become common as a means to deal with the vast L4 last-level caches (LLCs) enabled by increasing device density, emerging memory technologies, and advanced integration capabilities (e.g., 3-D). Tagless schemes often result in intercache entanglement between tagless cache (L4) and the cache (L3) stewarding its metadata. We explore new cache organization policies that mitigate overheads stemming from the intercache-level replacement entanglement. We incorporate support for explicit tiling shapes that can better match software access patterns to improve the spatial and temporal locality of large block allocations in many essential computational kernels. To address entanglement overheads and pathologies, we propose new replacement policies and energy-friendly mechanisms for tagless LLCs, such as restricted block caching (RBC) and victim tag buffer caching (VBC) to incorporate L4 eviction costs into L3 replacement decisions efficiently. We evaluate our schemes on a range of linear algebra kernels that are software tiled. RBC and VBC demonstrate a reduction in memory traffic of 83/4.4/67% and 69/35.5/76% for 8/32/64 MB L4s, respectively. Besides, RBC and VBC provide speedups of 16/0.3/0.6% and 15.7/1.8/0.8%, respectively, for systems with 8/32/64 MB L4, over a tagless cache with an LRU policy in the L3. We also show that matching the shape of the hardware allocation for each tagless region superblocks to the access order of the software tile improves latency by 13.4% over the baseline tagless cache with reductions in memory traffic of 51% over linear superblocks.
UR - http://www.scopus.com/inward/record.url?scp=85096034247&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85096034247&partnerID=8YFLogxK
U2 - 10.1109/TCAD.2020.3012789
DO - 10.1109/TCAD.2020.3012789
M3 - Article
AN - SCOPUS:85096034247
SN - 0278-0070
VL - 39
SP - 3881
EP - 3892
JO - IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
JF - IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
IS - 11
M1 - 9211458
ER -