Litz: Elastic framework for high-performance distributed machine learning

Aurick Qiao, Abutalib Aghayev, Weiren Yu, Haoyang Chen, Qirong Ho, Garth A. Gibson, Eric P. Xing

Research output: Chapter in Book/Report/Conference proceedingConference contribution

28 Scopus citations

Abstract

Machine Learning (ML) is an increasingly popular application in the cloud and data-center, inspiring new algorithmic and systems techniques that leverage unique properties of ML applications to improve their distributed performance by orders of magnitude. However, applications built using these techniques tend to be static, unable to elastically adapt to the changing resource availability that is characteristic of multi-tenant environments. Existing distributed frameworks are either inelastic, or offer programming models which are incompatible with the techniques employed by high-performance ML applications. Motivated by these trends, we present Litz, an elastic framework supporting distributed ML applications. We categorize the wide variety of techniques employed by these applications into three general themes - stateful workers, model scheduling, and relaxed consistency - which are collectively supported by Litz's programming model. Our implementation of Litz's execution system transparently enables elasticity and low-overhead execution. We implement several popular ML applications using Litz, and show that they can scale in and out quickly to adapt to changing resource availability, as well as how a scheduler can leverage elasticity for faster job completion and more efficient resource allocation. Lastly, we show that Litz enables elasticity without compromising performance, achieving competitive performance with state-of-the-art non-elastic ML frameworks.

Original languageEnglish (US)
Title of host publicationProceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC 2018
PublisherUSENIX Association
Pages631-643
Number of pages13
ISBN (Electronic)9781939133021
StatePublished - 2020
Event2018 USENIX Annual Technical Conference, USENIX ATC 2018 - Boston, United States
Duration: Jul 11 2018Jul 13 2018

Publication series

NameProceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC 2018

Conference

Conference2018 USENIX Annual Technical Conference, USENIX ATC 2018
Country/TerritoryUnited States
CityBoston
Period7/11/187/13/18

All Science Journal Classification (ASJC) codes

  • General Computer Science

Fingerprint

Dive into the research topics of 'Litz: Elastic framework for high-performance distributed machine learning'. Together they form a unique fingerprint.

Cite this