TY - GEN
T1 - Optimizing CPU Performance for Recommendation Systems At-Scale
AU - Jain, Rishabh
AU - Sanghavi, Vrushabh
AU - Maeng, Kiwan
AU - Cheng, Scott
AU - Kaul, Samvit
AU - Jog, Adwait
AU - Kalagi, Vishwas
AU - Arunachalam, Meena
AU - Sivasubramaniam, Anand
AU - Kandemir, Mahmut T.
AU - Das, Chita R.
N1 - Publisher Copyright:
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2023/6/17
Y1 - 2023/6/17
N2 - Deep Learning Recommendation Models (DLRMs) are very popular in personalized recommendation systems and are a major contributor to the data-center AI cycles. Due to the high computational and memory bandwidth needs of DLRMs, specifically the embedding stage in DLRM inferences, both CPUs and GPUs are used for hosting such workloads. This is primarily because of the heavy irregular memory accesses in the embedding stage of computation that leads to significant stalls in the CPU pipeline. As the model and parameter sizes keep increasing with newer recommendation models, the computational dominance of the embedding stage also grows, thereby, bringing into question the suitability of CPUs for inference. In this paper, we first quantify the cause of irregular accesses and their impact on caches and observe that off-chip memory access is the main contributor to high latency. Therefore, we exploit two well-known techniques: (1) Software prefetching, to hide the memory access latency suffered by the demand loads and (2) Overlapping computation and memory accesses, to reduce CPU stalls via hyperthreading to minimize the overall execution time. We evaluate our work on a single-core and 24-core configuration with the latest recommendation models and recently released production traces. Our integrated techniques speed up the inference by up to 1.59x, and on average by 1.4x.
AB - Deep Learning Recommendation Models (DLRMs) are very popular in personalized recommendation systems and are a major contributor to the data-center AI cycles. Due to the high computational and memory bandwidth needs of DLRMs, specifically the embedding stage in DLRM inferences, both CPUs and GPUs are used for hosting such workloads. This is primarily because of the heavy irregular memory accesses in the embedding stage of computation that leads to significant stalls in the CPU pipeline. As the model and parameter sizes keep increasing with newer recommendation models, the computational dominance of the embedding stage also grows, thereby, bringing into question the suitability of CPUs for inference. In this paper, we first quantify the cause of irregular accesses and their impact on caches and observe that off-chip memory access is the main contributor to high latency. Therefore, we exploit two well-known techniques: (1) Software prefetching, to hide the memory access latency suffered by the demand loads and (2) Overlapping computation and memory accesses, to reduce CPU stalls via hyperthreading to minimize the overall execution time. We evaluate our work on a single-core and 24-core configuration with the latest recommendation models and recently released production traces. Our integrated techniques speed up the inference by up to 1.59x, and on average by 1.4x.
UR - http://www.scopus.com/inward/record.url?scp=85168865613&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85168865613&partnerID=8YFLogxK
U2 - 10.1145/3579371.3589112
DO - 10.1145/3579371.3589112
M3 - Conference contribution
AN - SCOPUS:85168865613
T3 - Proceedings - International Symposium on Computer Architecture
SP - 1078
EP - 1092
BT - ISCA 2023 - Proceedings of the 2023 50th Annual International Symposium on Computer Architecture
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 50th Annual International Symposium on Computer Architecture, ISCA 2023
Y2 - 17 June 2023 through 21 June 2023
ER -