TY - GEN
T1 - GMT
T2 - 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2024
AU - Chang, Chia Hao
AU - Mailthody, Vikram Sharma
AU - Han, Jihoon
AU - Qureshi, Zaid
AU - Sivasubramaniam, Anand
AU - Hwu, Wen Mei
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2024/4/27
Y1 - 2024/4/27
N2 - As the demand for processing larger datasets increases, GPUs need to reach deeper into their (memory) hierarchy to directly access capacities that only storage systems (SSDs) can hold. However, the state-of-the-art mechanisms to reach storage either employ software stacks running on the host CPUs as intermediaries (e.g. Dragon, HMM), which has been noted to perform poorly and not able to meet the throughput needs of GPU cores, or directly access SSDs through NVMe queues (BaM) which does not benefit from lower latencies that may be possible by having the host memory as an intermediate tier. This paper presents the design and implementation of GPU Memory Tiering (GMT) by implementing a GPU-orchestrated 3-tier hierarchy comprising GPU memory, host memory and SSDs, where the GPU orchestrates most of the transfers that are bandwidth/latency sensitive. Additionally, it is important to not blindly transfer pages from the GPU memory to host memory upon an eviction, and GMT employs a reuse-prediction based practical insertion policy to perform discretionary page placement/bypass. An implementation and evaluation on an actual platform demonstrates that GMT performs 50% better than the state-of-the-art 2-tier strategy (BaM) and over 350% better than the state-of-the-art 3-tier strategy that is orchestrated by host CPUs (HMM), over a number of GPU applications with diverse memory access characteristics.
AB - As the demand for processing larger datasets increases, GPUs need to reach deeper into their (memory) hierarchy to directly access capacities that only storage systems (SSDs) can hold. However, the state-of-the-art mechanisms to reach storage either employ software stacks running on the host CPUs as intermediaries (e.g. Dragon, HMM), which has been noted to perform poorly and not able to meet the throughput needs of GPU cores, or directly access SSDs through NVMe queues (BaM) which does not benefit from lower latencies that may be possible by having the host memory as an intermediate tier. This paper presents the design and implementation of GPU Memory Tiering (GMT) by implementing a GPU-orchestrated 3-tier hierarchy comprising GPU memory, host memory and SSDs, where the GPU orchestrates most of the transfers that are bandwidth/latency sensitive. Additionally, it is important to not blindly transfer pages from the GPU memory to host memory upon an eviction, and GMT employs a reuse-prediction based practical insertion policy to perform discretionary page placement/bypass. An implementation and evaluation on an actual platform demonstrates that GMT performs 50% better than the state-of-the-art 2-tier strategy (BaM) and over 350% better than the state-of-the-art 3-tier strategy that is orchestrated by host CPUs (HMM), over a number of GPU applications with diverse memory access characteristics.
UR - http://www.scopus.com/inward/record.url?scp=85192182190&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85192182190&partnerID=8YFLogxK
U2 - 10.1145/3620666.3651353
DO - 10.1145/3620666.3651353
M3 - Conference contribution
AN - SCOPUS:85192182190
T3 - International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
SP - 464
EP - 478
BT - Fall Cycle
PB - Association for Computing Machinery
Y2 - 27 April 2024 through 1 May 2024
ER -