TY - GEN
T1 - PIFS-Rec
T2 - 57th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2024
AU - Huo, Pingyi
AU - Devulapally, Anusha
AU - Maruf, Hasan Al
AU - Park, Minseo
AU - Nair, Krishnakumar
AU - Arunachalam, Meena
AU - Akbulut, Gulsum Gudukbay
AU - Kandemir, Mahmut Taylan
AU - Narayanan, Vijaykrishnan
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Deep Learning Recommendation Models (DLRMs) have become increasingly popular and prevalent in today's datacenters, consuming most of the AI inference cycles. The performance of DLRMs is heavily influenced by available band-width due to their large vector sizes in embedding tables and concurrent accesses. To achieve substantial improvements over existing solutions, novel approaches towards DLRM optimization are needed, especially, in the context of emerging interconnect technologies like CXL. This study delves into exploring CXL-enabled systems, implementing a process-in-fabric-switch (PIFS) solution to accelerate DLRMs while optimizing their memory and bandwidth scalability. We present an in-depth characterization of industry-scale DLRM workloads running on CXL-ready systems, identifying the predominant bottlenecks in existing CXL systems. We, therefore, propose PIFS-Rec, a PIFS-based scheme that implements near-data processing through downstream ports of the fabric switch. PIFS-Rec achieves a latency that is 3.89 x lower than Pond, an industry-standard CXL-based system, and also outperforms BEACON, a state-of-The-Art scheme, by 2.03x.
AB - Deep Learning Recommendation Models (DLRMs) have become increasingly popular and prevalent in today's datacenters, consuming most of the AI inference cycles. The performance of DLRMs is heavily influenced by available band-width due to their large vector sizes in embedding tables and concurrent accesses. To achieve substantial improvements over existing solutions, novel approaches towards DLRM optimization are needed, especially, in the context of emerging interconnect technologies like CXL. This study delves into exploring CXL-enabled systems, implementing a process-in-fabric-switch (PIFS) solution to accelerate DLRMs while optimizing their memory and bandwidth scalability. We present an in-depth characterization of industry-scale DLRM workloads running on CXL-ready systems, identifying the predominant bottlenecks in existing CXL systems. We, therefore, propose PIFS-Rec, a PIFS-based scheme that implements near-data processing through downstream ports of the fabric switch. PIFS-Rec achieves a latency that is 3.89 x lower than Pond, an industry-standard CXL-based system, and also outperforms BEACON, a state-of-The-Art scheme, by 2.03x.
UR - https://www.scopus.com/pages/publications/85213324460
UR - https://www.scopus.com/pages/publications/85213324460#tab=citedBy
U2 - 10.1109/MICRO61859.2024.00052
DO - 10.1109/MICRO61859.2024.00052
M3 - Conference contribution
AN - SCOPUS:85213324460
T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO
SP - 612
EP - 626
BT - Proceedings - 2024 57th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2024
PB - IEEE Computer Society
Y2 - 2 November 2024 through 6 November 2024
ER -