Overflowing emerging neural network inference tasks from the GPU to the CPU on heterogeneous servers

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Scopus citations

Abstract

While current deep learning (DL) inference runtime systems sequentially offload the model's tasks on to an available GPU/accelerator based on its capability, we make a case for selectively redirecting some of these tasks to the CPU and running them concurrently with the GPU doing other work. This new opportunity specifically arises for emerging DL models whose data flow graphs (DFGs) have much wider fan-outs compared to traditional ones which are invariably linear chains of tasks. By opportunistically moving some of these tasks to the CPU, we can (i) shave off service times from the critical path of the DFG, (ii) devote the GPU for more deserving tasks, and (iii) improve overall utilization of the provisioned hardware in the server. However, several factors such as its criticality in the DFG, slowdown when moved to a different hardware engine, and overheads in transferring input/output data across these engines, determine the what/when/how of tasks to be directed. While this is computationally demanding and slow to be solved optimally, through a series of rationales we derive a fast technique for task overflow from GPU to CPU. We implement this technique on a nimble heterogeneous concurrent runtime engine built on top of the state-of-the-art ONNXRuntime engine and demonstrate > 10% reduction in latency, > 19% gain in throughput, and > 9.8% savings in GPU memory usage for emerging neural network models.

Original languageEnglish (US)
Title of host publicationSYSTOR 2022 - Proceedings of the 15th ACM International Conference on Systems and Storage Conference
PublisherAssociation for Computing Machinery, Inc
Pages26-39
Number of pages14
ISBN (Electronic)9781450393805
DOIs
StatePublished - Jun 6 2022
Event15th ACM International Systems and Storage Conference, SYSTOR 2022 - Virtual, Online, Israel
Duration: Jun 13 2022Jun 15 2022

Publication series

NameSYSTOR 2022 - Proceedings of the 15th ACM International Conference on Systems and Storage Conference

Conference

Conference15th ACM International Systems and Storage Conference, SYSTOR 2022
Country/TerritoryIsrael
CityVirtual, Online
Period6/13/226/15/22

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Hardware and Architecture
  • Software
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Overflowing emerging neural network inference tasks from the GPU to the CPU on heterogeneous servers'. Together they form a unique fingerprint.

Cite this