TY - JOUR
T1 - Parallelization Strategies for DLRM Embedding Bag Operator on AMD CPUs
AU - Nair, Krishnakumar
AU - Pandey, Avinash Chandra
AU - Karabannavar, Siddappa
AU - Arunachalam, Meena
AU - Kalamatianos, John
AU - Agrawal, Varun
AU - Gupta, Saurabh
AU - Sirasao, Ashish
AU - Delaye, Elliott
AU - Reinhardt, Steve
AU - Vivekanandham, Rajesh
AU - Wittig, Ralph
AU - Kathail, Vinod
AU - Gopalakrishnan, Padmini
AU - Pareek, Satyaprakash
AU - Jain, Rishabh
AU - Kandemir, Mahmut Taylan
AU - Lin, Jun Liang
AU - Akbulut, Gulsum Gudukbay
AU - Das, Chita R.
N1 - Publisher Copyright:
© 1981-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - Deep learning recommendation models (DLRMs) are deployed extensively to support personalized recommendations and consume a large fraction of artificial intelligence (AI) cycles in modern datacenters with embedding stage being a critical component. Modern CPUs execute a lot of DLRM cycles because they are cost effective compared to GPUs and other accelerators. Our paper addresses key bottlenecks in accelerating the embedding stage on CPUs. Specifically, this work 1) explores novel threading schemes that parallelize embedding bag, 2) pushes the envelope on realized bandwidth by improving data reuse in caches, and 3) studies the impact of parallelization on load imbalance. The new embedding bag kernels have been prototyped in the ZenDNN software stack. When put together, our work on fourth generation EPYC processors achieve up to 9.9x improvement in embedding bag performance over state-of-the-art implementations, and improve realized bandwidth of up to 5.7x over DDR bandwidth.
AB - Deep learning recommendation models (DLRMs) are deployed extensively to support personalized recommendations and consume a large fraction of artificial intelligence (AI) cycles in modern datacenters with embedding stage being a critical component. Modern CPUs execute a lot of DLRM cycles because they are cost effective compared to GPUs and other accelerators. Our paper addresses key bottlenecks in accelerating the embedding stage on CPUs. Specifically, this work 1) explores novel threading schemes that parallelize embedding bag, 2) pushes the envelope on realized bandwidth by improving data reuse in caches, and 3) studies the impact of parallelization on load imbalance. The new embedding bag kernels have been prototyped in the ZenDNN software stack. When put together, our work on fourth generation EPYC processors achieve up to 9.9x improvement in embedding bag performance over state-of-the-art implementations, and improve realized bandwidth of up to 5.7x over DDR bandwidth.
UR - http://www.scopus.com/inward/record.url?scp=85200810872&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85200810872&partnerID=8YFLogxK
U2 - 10.1109/MM.2024.3423785
DO - 10.1109/MM.2024.3423785
M3 - Article
AN - SCOPUS:85200810872
SN - 0272-1732
VL - 44
SP - 44
EP - 51
JO - IEEE Micro
JF - IEEE Micro
IS - 6
ER -