TY - GEN
T1 - FUSE
T2 - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019
AU - Zhang, Jie
AU - Jung, Myoungsoo
AU - Kandemir, Mahmut
N1 - Funding Information:
This research is mainly supported by NRF 2016R1C1B2015312, DOE DEAC02-05CH11231, IITP-2018-2017-0-01015, NRF 2015M3C4A7065645, Yonsei Future Research Grant (2017-22-0105) and MemRay grant (2015-11-1731). M. Kandemir is supported in part by grants by NSF grants 1822923, 1439021, 1629915, 1626251, 1629129, 1763681, 1526750 and 1439057. Myoungsoo Jung is the corresponding author who has ownership of this work.
Publisher Copyright:
© 2019 IEEE.
PY - 2019/3/26
Y1 - 2019/3/26
N2 - In this work, we propose FUSE, a novel GPU cache system that integrates spin-transfer torque magnetic random-access memory (STT-MRAM) into the on-chip L1D cache. FUSE can minimize the number of outgoing memory accesses over the interconnection network of GPU's multiprocessors, which in turn can considerably improve the level of massive computing parallelism in GPUs. Specifically, FUSE predicts a read-level of GPU memory accesses by extracting GPU runtime information and places write-once-read-multiple (WORM) data blocks into the STT-MRAM, while accommodating write-multiple data blocks over a small portion of SRAM in the L1D cache. To further reduce the off-chip memory accesses, FUSE also allows WORM data blocks to be allocated anywhere in the STT-MRAM by approximating the associativity with the limited number of tag comparators and I/O peripherals. Our evaluation results show that, in comparison to a traditional GPU cache, our proposed heterogeneous cache reduces the number of outgoing memory references by 32% across the interconnection network, thereby improving the overall performance by 217% and reducing energy cost by 53%.
AB - In this work, we propose FUSE, a novel GPU cache system that integrates spin-transfer torque magnetic random-access memory (STT-MRAM) into the on-chip L1D cache. FUSE can minimize the number of outgoing memory accesses over the interconnection network of GPU's multiprocessors, which in turn can considerably improve the level of massive computing parallelism in GPUs. Specifically, FUSE predicts a read-level of GPU memory accesses by extracting GPU runtime information and places write-once-read-multiple (WORM) data blocks into the STT-MRAM, while accommodating write-multiple data blocks over a small portion of SRAM in the L1D cache. To further reduce the off-chip memory accesses, FUSE also allows WORM data blocks to be allocated anywhere in the STT-MRAM by approximating the associativity with the limited number of tag comparators and I/O peripherals. Our evaluation results show that, in comparison to a traditional GPU cache, our proposed heterogeneous cache reduces the number of outgoing memory references by 32% across the interconnection network, thereby improving the overall performance by 217% and reducing energy cost by 53%.
UR - http://www.scopus.com/inward/record.url?scp=85064223760&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85064223760&partnerID=8YFLogxK
U2 - 10.1109/HPCA.2019.00055
DO - 10.1109/HPCA.2019.00055
M3 - Conference contribution
AN - SCOPUS:85064223760
T3 - Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019
SP - 426
EP - 439
BT - Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 16 February 2019 through 20 February 2019
ER -