Load and MLP-Aware Thread Orchestration for Recommendation Systems Inference on CPUs

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Recommendation models can enhance consumer experiences and are one of the most frequently used machine learning models in data centers. The deep learning recommendation model (DLRM) is one such key workload. While DLRMs are often trained using GPUs, CPUs can be a cost-effective solution for inference. Therefore, optimizing DLRM inference for CPUs is an important research problem with significant business value. In this work, we identify several shortcomings of existing DLRM parallelization techniques, which can include load imbalance across CPU chiplets, suboptimal core allocation for embedding tables, and inefficient utilization of memory- level parallelism (MLP) resources. We propose a novel thread scheduler, called ''Balance,'' that addresses those shortcomings by (1) minimizing core allocation per embedding table to maximize core utilization, (2) using MLP-aware task scheduling based on the characteristics of the embedding tables to better utilize memory bandwidth, and (3) combining work stealing and table reordering mechanisms to reduce load imbalance across CPU chiplets. We evaluate Balance on real hardware with production DLRM traces and demonstrate up to a 1.67× higher speedup over prior state-of-the-art DLRM parallelization techniques with 96 cores. Further, Balance consistently achieves 1.22× higher performance over a range of batch sizes.

Original languageEnglish (US)
Title of host publicationASPLOS 2025 - Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems
PublisherAssociation for Computing Machinery
Pages589-603
Number of pages15
ISBN (Electronic)9798400710797
DOIs
StatePublished - Mar 30 2025
Event30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2025 - Rotterdam, Netherlands
Duration: Mar 30 2025Apr 3 2025

Publication series

NameInternational Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
Volume2

Conference

Conference30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2025
Country/TerritoryNetherlands
CityRotterdam
Period3/30/254/3/25

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Load and MLP-Aware Thread Orchestration for Recommendation Systems Inference on CPUs'. Together they form a unique fingerprint.

Cite this