TY - JOUR
T1 - On the Feasibility of Distributed Kernel Regression for Big Data
AU - Xu, Chen
AU - Zhang, Yongquan
AU - Li, Runze
AU - Wu, Xindong
N1 - Funding Information:
This work is supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC) grant RGPIN-2016-05024, National Institute on Drug Abuse (NIDA) grants P50 DA10075 and P50 DA036107, US National Science Foundation (NSF) grant DMS 1512422, and Natural Science Foundation of China (NSFC) grants 11301494 and 61573326.
Publisher Copyright:
© 2016 IEEE.
PY - 2016/11/1
Y1 - 2016/11/1
N2 - In Big Data applications, massive datasets with huge numbers of observations are frequently encountered. To deal with such massive datasets, a divide-and-conquer scheme (e.g., MapReduce) is often used for the analysis of Big Data. With such a strategy, a large dataset (e.g., a centralized real database or a virtual database with distributed data sources) is first divided into smaller manageable segments; the final output is then aggregated from the individual outputs of the segments. Despite its popularity in practice, it remains largely unknown whether such a distributive strategy provides valid theoretical inferences to the original data. In this paper, we address this fundamental issue for the distributed kernel regression (DKR) problem, where the algorithmic feasibility is measured by the generalization performance of the resulting estimator. To justify DKR, a uniform convergence rate is needed for bounding the generalization error over the individual outputs, which brings new and challenging issues in the Big Data setup. Using a sample dependent kernel dictionary, we show that, with proper data segmentation, DKR leads to an estimator that is generalization consistent to the unknown regression function. This result theoretically justifies DKR and sheds light on more advanced distributive algorithms for processing Big Data. The promising performance of the method is supported by both simulation and real data examples.
AB - In Big Data applications, massive datasets with huge numbers of observations are frequently encountered. To deal with such massive datasets, a divide-and-conquer scheme (e.g., MapReduce) is often used for the analysis of Big Data. With such a strategy, a large dataset (e.g., a centralized real database or a virtual database with distributed data sources) is first divided into smaller manageable segments; the final output is then aggregated from the individual outputs of the segments. Despite its popularity in practice, it remains largely unknown whether such a distributive strategy provides valid theoretical inferences to the original data. In this paper, we address this fundamental issue for the distributed kernel regression (DKR) problem, where the algorithmic feasibility is measured by the generalization performance of the resulting estimator. To justify DKR, a uniform convergence rate is needed for bounding the generalization error over the individual outputs, which brings new and challenging issues in the Big Data setup. Using a sample dependent kernel dictionary, we show that, with proper data segmentation, DKR leads to an estimator that is generalization consistent to the unknown regression function. This result theoretically justifies DKR and sheds light on more advanced distributive algorithms for processing Big Data. The promising performance of the method is supported by both simulation and real data examples.
UR - http://www.scopus.com/inward/record.url?scp=84994442841&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84994442841&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2016.2594060
DO - 10.1109/TKDE.2016.2594060
M3 - Article
AN - SCOPUS:84994442841
SN - 1041-4347
VL - 28
SP - 3041
EP - 3052
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 11
M1 - 7520638
ER -