TY - GEN
T1 - Cocktail
T2 - 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022
AU - Gunasekaran, Jashwant Raj
AU - Mishra, Cyan Subhra
AU - Thinakaran, Prashanth
AU - Sharma, Bikash
AU - Kandemir, Mahmut Taylan
AU - Das, Chita R.
N1 - Publisher Copyright:
© 2022 by The USENIX Association. All Rights Reserved.
PY - 2022
Y1 - 2022
N2 - With a growing demand for adopting ML models for a variety of application services, it is vital that the frameworks serving these models are capable of delivering highly accurate predictions with minimal latency along with reduced deployment costs in a public cloud environment. Despite high latency, prior works in this domain are crucially limited by the accuracy offered by individual models. Intuitively, model ensembling can address the accuracy gap by intelligently combining different models in parallel. However, selecting the appropriate models dynamically at runtime to meet the desired accuracy with low latency at minimal deployment cost is a nontrivial problem. Towards this, we propose Cocktail, a cost effective ensembling-based model serving framework. Cocktail comprises of two key components: (i) a dynamic model selection framework, which reduces the number of models in the ensemble, while satisfying the accuracy and latency requirements; (ii) an adaptive resource management (RM) framework that employs a distributed proactive autoscaling policy, to efficiently allocate resources for the models. The RM framework leverages transient virtual machine (VM) instances to reduce the deployment cost in a public cloud. A prototype implementation of Cocktail on the AWS EC2 platform and exhaustive evaluations using a variety of workloads demonstrate that Cocktail can reduce deployment cost by 1.45×, while providing 2× reduction in latency and satisfying the target accuracy for up to 96% of the requests, when compared to state-of-the-art model-serving frameworks.
AB - With a growing demand for adopting ML models for a variety of application services, it is vital that the frameworks serving these models are capable of delivering highly accurate predictions with minimal latency along with reduced deployment costs in a public cloud environment. Despite high latency, prior works in this domain are crucially limited by the accuracy offered by individual models. Intuitively, model ensembling can address the accuracy gap by intelligently combining different models in parallel. However, selecting the appropriate models dynamically at runtime to meet the desired accuracy with low latency at minimal deployment cost is a nontrivial problem. Towards this, we propose Cocktail, a cost effective ensembling-based model serving framework. Cocktail comprises of two key components: (i) a dynamic model selection framework, which reduces the number of models in the ensemble, while satisfying the accuracy and latency requirements; (ii) an adaptive resource management (RM) framework that employs a distributed proactive autoscaling policy, to efficiently allocate resources for the models. The RM framework leverages transient virtual machine (VM) instances to reduce the deployment cost in a public cloud. A prototype implementation of Cocktail on the AWS EC2 platform and exhaustive evaluations using a variety of workloads demonstrate that Cocktail can reduce deployment cost by 1.45×, while providing 2× reduction in latency and satisfying the target accuracy for up to 96% of the requests, when compared to state-of-the-art model-serving frameworks.
UR - http://www.scopus.com/inward/record.url?scp=85139349183&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85139349183&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85139349183
T3 - Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022
SP - 1041
EP - 1057
BT - Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022
PB - USENIX Association
Y2 - 4 April 2022 through 6 April 2022
ER -