TY - GEN
T1 - Joint Coreset Construction and Quantization for Distributed Machine Learning
AU - Lu, Hanlin
AU - Liu, Changchang
AU - Wang, Shiqiang
AU - He, Ting
AU - Narayanan, Vijaykrishnan
AU - Chan, Kevin S.
AU - Pasteris, Stephen
N1 - Funding Information:
This research was partly sponsored by the U.S. Army Research Laboratory and the U.K. Ministry of Defence under Agreement Number W911NF-16-3-0001. Narayanan was partly supported by NSF 1317560. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies of the U.S. Army Research Laboratory, the U.S. Government, the U.K. Ministry of Defence or the U.K. Government. The U.S. and U.K. Governments are authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon.
PY - 2020/6
Y1 - 2020/6
N2 - Coresets are small, weighted summaries of larger datasets, aiming at providing provable error bounds for machine learning (ML) tasks while significantly reducing the communication and computation costs. To achieve a better trade-off between ML error bounds and costs, we propose the first framework to incorporate quantization techniques into the process of coreset construction. Specifically, we theoretically analyze the ML error bounds caused by a combination of coreset construction and quantization. Based on that, we formulate an optimization problem to minimize the ML error under a fixed budget of communication cost. To improve the scalability for large datasets, we identify two proxies of the original objective function, for which efficient algorithms are developed. For the case of data on multiple nodes, we further design a novel algorithm to allocate the communication budget to the nodes while minimizing the overall ML error. Through extensive experiments on multiple real-world datasets, we demonstrate the effectiveness and efficiency of our proposed algorithms for a variety of ML tasks. In particular, our algorithms have achieved more than 90% data reduction with less than 10% degradation in ML performance in most cases.
AB - Coresets are small, weighted summaries of larger datasets, aiming at providing provable error bounds for machine learning (ML) tasks while significantly reducing the communication and computation costs. To achieve a better trade-off between ML error bounds and costs, we propose the first framework to incorporate quantization techniques into the process of coreset construction. Specifically, we theoretically analyze the ML error bounds caused by a combination of coreset construction and quantization. Based on that, we formulate an optimization problem to minimize the ML error under a fixed budget of communication cost. To improve the scalability for large datasets, we identify two proxies of the original objective function, for which efficient algorithms are developed. For the case of data on multiple nodes, we further design a novel algorithm to allocate the communication budget to the nodes while minimizing the overall ML error. Through extensive experiments on multiple real-world datasets, we demonstrate the effectiveness and efficiency of our proposed algorithms for a variety of ML tasks. In particular, our algorithms have achieved more than 90% data reduction with less than 10% degradation in ML performance in most cases.
UR - http://www.scopus.com/inward/record.url?scp=85090039037&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85090039037&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85090039037
T3 - IFIP Networking 2020 Conference and Workshops, Networking 2020
SP - 172
EP - 180
BT - IFIP Networking 2020 Conference and Workshops, Networking 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 IFIP Networking Conference and Workshops, Networking 2020
Y2 - 22 June 2020 through 25 June 2020
ER -