TY - GEN
T1 - Litz
T2 - 2018 USENIX Annual Technical Conference, USENIX ATC 2018
AU - Qiao, Aurick
AU - Aghayev, Abutalib
AU - Yu, Weiren
AU - Chen, Haoyang
AU - Ho, Qirong
AU - Gibson, Garth A.
AU - Xing, Eric P.
N1 - Funding Information:
We thank the anonymous reviewers for their valuable feedback. We thank the members and companies of the PDL Consortium: Alibaba Group, Broadcom, Dell EMC, Facebook, Google, HP Enterprise, Hitachi, IBM Research, Intel, Micron, Microsoft Research, MongoDB, NetApp, Oracle, Salesforce, Samsung, Seagate Technology, Two Sigma, Toshiba, Veritas and Western Digital for their interest, insights, feedback, and support. Our work was supported by the U.S. National Science Foundation awards IIS1447676 and CCF1629559, the Natural Sciences and Engineering Research Council of Canada award PGSD-471301-2015, as well as the Beijing Advanced Innovation Center for Big Data and Brain Computing at Beihang University.
Publisher Copyright:
© Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC 2018. All rights reserved.
PY - 2020
Y1 - 2020
N2 - Machine Learning (ML) is an increasingly popular application in the cloud and data-center, inspiring new algorithmic and systems techniques that leverage unique properties of ML applications to improve their distributed performance by orders of magnitude. However, applications built using these techniques tend to be static, unable to elastically adapt to the changing resource availability that is characteristic of multi-tenant environments. Existing distributed frameworks are either inelastic, or offer programming models which are incompatible with the techniques employed by high-performance ML applications. Motivated by these trends, we present Litz, an elastic framework supporting distributed ML applications. We categorize the wide variety of techniques employed by these applications into three general themes - stateful workers, model scheduling, and relaxed consistency - which are collectively supported by Litz's programming model. Our implementation of Litz's execution system transparently enables elasticity and low-overhead execution. We implement several popular ML applications using Litz, and show that they can scale in and out quickly to adapt to changing resource availability, as well as how a scheduler can leverage elasticity for faster job completion and more efficient resource allocation. Lastly, we show that Litz enables elasticity without compromising performance, achieving competitive performance with state-of-the-art non-elastic ML frameworks.
AB - Machine Learning (ML) is an increasingly popular application in the cloud and data-center, inspiring new algorithmic and systems techniques that leverage unique properties of ML applications to improve their distributed performance by orders of magnitude. However, applications built using these techniques tend to be static, unable to elastically adapt to the changing resource availability that is characteristic of multi-tenant environments. Existing distributed frameworks are either inelastic, or offer programming models which are incompatible with the techniques employed by high-performance ML applications. Motivated by these trends, we present Litz, an elastic framework supporting distributed ML applications. We categorize the wide variety of techniques employed by these applications into three general themes - stateful workers, model scheduling, and relaxed consistency - which are collectively supported by Litz's programming model. Our implementation of Litz's execution system transparently enables elasticity and low-overhead execution. We implement several popular ML applications using Litz, and show that they can scale in and out quickly to adapt to changing resource availability, as well as how a scheduler can leverage elasticity for faster job completion and more efficient resource allocation. Lastly, we show that Litz enables elasticity without compromising performance, achieving competitive performance with state-of-the-art non-elastic ML frameworks.
UR - http://www.scopus.com/inward/record.url?scp=85074529407&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85074529407&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85074529407
T3 - Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC 2018
SP - 631
EP - 643
BT - Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC 2018
PB - USENIX Association
Y2 - 11 July 2018 through 13 July 2018
ER -