GPU-based Private Information Retrieval for On-Device Machine Learning Inference

Maximilian Lam, Jeff Johnson, Wenjie Xiong, Kiwan Maeng, Udit Gupta, Yang Li, Liangzhen Lai, Ilias Leontiadis, Minsoo Rhu, Hsien Hsin S. Lee, Vijay Janapa Reddi, Gu Yeon Wei, David Brooks, Edward Suh

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

On-device machine learning (ML) inference can enable the use of private user data on user devices without revealing them to remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. In particular, recommendation models typically use multiple embedding tables each on the order of 1 - 10 GBs of data, making them impractical to store on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) propose novel GPU-based acceleration of PIR, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than 20× over an optimized CPU PIR implementation, and our PIR-ML co-design provides an over 5× additional throughput improvement at fixed model quality. Together, for various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to 100,000 queries per second - -a > 100× throughput improvement over a CPU-based baseline - -while maintaining model accuracy.

Original languageEnglish (US)
Title of host publicationSpring Cycle
PublisherAssociation for Computing Machinery
Pages197-214
Number of pages18
ISBN (Electronic)9798400703720
DOIs
StatePublished - Apr 27 2024
Event29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2024 - San Diego, United States
Duration: Apr 27 2024May 1 2024

Publication series

NameInternational Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
Volume1

Conference

Conference29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2024
Country/TerritoryUnited States
CitySan Diego
Period4/27/245/1/24

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'GPU-based Private Information Retrieval for On-Device Machine Learning Inference'. Together they form a unique fingerprint.

Cite this