TY - GEN
T1 - Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs
AU - Jain, Rishabh
AU - Bhasi, Vivek M.
AU - Jog, Adwait
AU - Sivasubramaniam, Anand
AU - Kandemir, Mahmut
AU - Das, Chitaranjan
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Personalized recommendation is a ubiquitous appli-cation on the internet, with many industries and hyperscalers extensively leveraging Deep Learning Recommendation Models (DLRMs) for their personalization needs (like ad serving or movie suggestions). With growing model and dataset sizes pushing computation and memory requirements, GPUs are being increasingly preferred for executing DLRM inference. However, serving newer DLRMs, while meeting acceptable latencies, continues to remain challenging, making traditional deployments increasingly more GPU-hungry, resulting in higher inference serving costs. In this paper, we show that the embedding stage continues to be the primary bottleneck in the GPU inference pipeline, leading up to a 3.2 x embedding-only performance slowdown. To thoroughly grasp the problem, we conduct a detailed microarchitecture characterization and highlight the presence of low occupancy in the standard embedding kernels. By leveraging direct compiler optimizations, we achieve optimal occupancy, pushing the performance by up to 53 %. Yet, long memory latency stalls continue to exist. To tackle this challenge, we propose spe-cialized plug-And-play-based software prefetching and L2 pinning techniques, which help in hiding and decreasing the latencies. Further, we propose combining them, as they complement each other. Experimental evaluations using AI00 GPUs with large models and datasets show that our proposed techniques improve performance by up to 103% for the embedding stage, and up to 77 % for the overall D LRM inference pipeline.
AB - Personalized recommendation is a ubiquitous appli-cation on the internet, with many industries and hyperscalers extensively leveraging Deep Learning Recommendation Models (DLRMs) for their personalization needs (like ad serving or movie suggestions). With growing model and dataset sizes pushing computation and memory requirements, GPUs are being increasingly preferred for executing DLRM inference. However, serving newer DLRMs, while meeting acceptable latencies, continues to remain challenging, making traditional deployments increasingly more GPU-hungry, resulting in higher inference serving costs. In this paper, we show that the embedding stage continues to be the primary bottleneck in the GPU inference pipeline, leading up to a 3.2 x embedding-only performance slowdown. To thoroughly grasp the problem, we conduct a detailed microarchitecture characterization and highlight the presence of low occupancy in the standard embedding kernels. By leveraging direct compiler optimizations, we achieve optimal occupancy, pushing the performance by up to 53 %. Yet, long memory latency stalls continue to exist. To tackle this challenge, we propose spe-cialized plug-And-play-based software prefetching and L2 pinning techniques, which help in hiding and decreasing the latencies. Further, we propose combining them, as they complement each other. Experimental evaluations using AI00 GPUs with large models and datasets show that our proposed techniques improve performance by up to 103% for the embedding stage, and up to 77 % for the overall D LRM inference pipeline.
UR - http://www.scopus.com/inward/record.url?scp=85213329816&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85213329816&partnerID=8YFLogxK
U2 - 10.1109/MICRO61859.2024.00091
DO - 10.1109/MICRO61859.2024.00091
M3 - Conference contribution
AN - SCOPUS:85213329816
T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO
SP - 1217
EP - 1232
BT - Proceedings - 2024 57th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2024
PB - IEEE Computer Society
T2 - 57th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2024
Y2 - 2 November 2024 through 6 November 2024
ER -