Parallelization Strategies for DLRM Embedding Bag Operator on AMD CPUs

Krishnakumar Nair, Avinash Chandra Pandey, Siddappa Karabannavar, Meena Arunachalam, John Kalamatianos, Varun Agrawal, Saurabh Gupta, Ashish Sirasao, Elliott Delaye, Steve Reinhardt, Rajesh Vivekanandham, Ralph Wittig, Vinod Kathail, Padmini Gopalakrishnan, Satyaprakash Pareek, Rishabh Jain, Mahmut Taylan Kandemir, Jun Liang Lin, Gulsum Gudukbay Akbulut, Chita R. Das

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Deep learning recommendation models (DLRMs) are deployed extensively to support personalized recommendations and consume a large fraction of artificial intelligence (AI) cycles in modern datacenters with embedding stage being a critical component. Modern CPUs execute a lot of DLRM cycles because they are cost effective compared to GPUs and other accelerators. Our paper addresses key bottlenecks in accelerating the embedding stage on CPUs. Specifically, this work 1) explores novel threading schemes that parallelize embedding bag, 2) pushes the envelope on realized bandwidth by improving data reuse in caches, and 3) studies the impact of parallelization on load imbalance. The new embedding bag kernels have been prototyped in the ZenDNN software stack. When put together, our work on fourth generation EPYC processors achieve up to 9.9x improvement in embedding bag performance over state-of-the-art implementations, and improve realized bandwidth of up to 5.7x over DDR bandwidth.

Original languageEnglish (US)
Pages (from-to)44-51
Number of pages8
JournalIEEE Micro
Volume44
Issue number6
DOIs
StatePublished - 2024

All Science Journal Classification (ASJC) codes

  • Software
  • Hardware and Architecture
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Parallelization Strategies for DLRM Embedding Bag Operator on AMD CPUs'. Together they form a unique fingerprint.

Cite this