TY - GEN
T1 - GPU-based Private Information Retrieval for On-Device Machine Learning Inference
AU - Lam, Maximilian
AU - Johnson, Jeff
AU - Xiong, Wenjie
AU - Maeng, Kiwan
AU - Gupta, Udit
AU - Li, Yang
AU - Lai, Liangzhen
AU - Leontiadis, Ilias
AU - Rhu, Minsoo
AU - Lee, Hsien Hsin S.
AU - Reddi, Vijay Janapa
AU - Wei, Gu Yeon
AU - Brooks, David
AU - Suh, Edward
N1 - Publisher Copyright:
© 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.
PY - 2024/4/27
Y1 - 2024/4/27
N2 - On-device machine learning (ML) inference can enable the use of private user data on user devices without revealing them to remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. In particular, recommendation models typically use multiple embedding tables each on the order of 1 - 10 GBs of data, making them impractical to store on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) propose novel GPU-based acceleration of PIR, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than 20× over an optimized CPU PIR implementation, and our PIR-ML co-design provides an over 5× additional throughput improvement at fixed model quality. Together, for various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to 100,000 queries per second - -a > 100× throughput improvement over a CPU-based baseline - -while maintaining model accuracy.
AB - On-device machine learning (ML) inference can enable the use of private user data on user devices without revealing them to remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. In particular, recommendation models typically use multiple embedding tables each on the order of 1 - 10 GBs of data, making them impractical to store on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) propose novel GPU-based acceleration of PIR, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than 20× over an optimized CPU PIR implementation, and our PIR-ML co-design provides an over 5× additional throughput improvement at fixed model quality. Together, for various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to 100,000 queries per second - -a > 100× throughput improvement over a CPU-based baseline - -while maintaining model accuracy.
UR - http://www.scopus.com/inward/record.url?scp=85191416892&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85191416892&partnerID=8YFLogxK
U2 - 10.1145/3617232.3624855
DO - 10.1145/3617232.3624855
M3 - Conference contribution
AN - SCOPUS:85191416892
T3 - International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
SP - 197
EP - 214
BT - Spring Cycle
PB - Association for Computing Machinery
T2 - 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2024
Y2 - 27 April 2024 through 1 May 2024
ER -