TY - GEN
T1 - Load and MLP-Aware Thread Orchestration for Recommendation Systems Inference on CPUs
AU - Jain, Rishabh
AU - Chou, Teyuh
AU - Kayiran, Onur
AU - Kalamatianos, John
AU - Loh, Gabriel H.
AU - Kandemir, Mahmut T.
AU - Das, Chita R.
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/3/30
Y1 - 2025/3/30
N2 - Recommendation models can enhance consumer experiences and are one of the most frequently used machine learning models in data centers. The deep learning recommendation model (DLRM) is one such key workload. While DLRMs are often trained using GPUs, CPUs can be a cost-effective solution for inference. Therefore, optimizing DLRM inference for CPUs is an important research problem with significant business value. In this work, we identify several shortcomings of existing DLRM parallelization techniques, which can include load imbalance across CPU chiplets, suboptimal core allocation for embedding tables, and inefficient utilization of memory- level parallelism (MLP) resources. We propose a novel thread scheduler, called ''Balance,'' that addresses those shortcomings by (1) minimizing core allocation per embedding table to maximize core utilization, (2) using MLP-aware task scheduling based on the characteristics of the embedding tables to better utilize memory bandwidth, and (3) combining work stealing and table reordering mechanisms to reduce load imbalance across CPU chiplets. We evaluate Balance on real hardware with production DLRM traces and demonstrate up to a 1.67× higher speedup over prior state-of-the-art DLRM parallelization techniques with 96 cores. Further, Balance consistently achieves 1.22× higher performance over a range of batch sizes.
AB - Recommendation models can enhance consumer experiences and are one of the most frequently used machine learning models in data centers. The deep learning recommendation model (DLRM) is one such key workload. While DLRMs are often trained using GPUs, CPUs can be a cost-effective solution for inference. Therefore, optimizing DLRM inference for CPUs is an important research problem with significant business value. In this work, we identify several shortcomings of existing DLRM parallelization techniques, which can include load imbalance across CPU chiplets, suboptimal core allocation for embedding tables, and inefficient utilization of memory- level parallelism (MLP) resources. We propose a novel thread scheduler, called ''Balance,'' that addresses those shortcomings by (1) minimizing core allocation per embedding table to maximize core utilization, (2) using MLP-aware task scheduling based on the characteristics of the embedding tables to better utilize memory bandwidth, and (3) combining work stealing and table reordering mechanisms to reduce load imbalance across CPU chiplets. We evaluate Balance on real hardware with production DLRM traces and demonstrate up to a 1.67× higher speedup over prior state-of-the-art DLRM parallelization techniques with 96 cores. Further, Balance consistently achieves 1.22× higher performance over a range of batch sizes.
UR - https://www.scopus.com/pages/publications/105002573741
UR - https://www.scopus.com/pages/publications/105002573741#tab=citedBy
U2 - 10.1145/3676641.3716003
DO - 10.1145/3676641.3716003
M3 - Conference contribution
AN - SCOPUS:105002573741
T3 - International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
SP - 589
EP - 603
BT - ASPLOS 2025 - Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems
PB - Association for Computing Machinery
T2 - 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2025
Y2 - 30 March 2025 through 3 April 2025
ER -