On the Feasibility of Distributed Kernel Regression for Big Data

Chen Xu, Yongquan Zhang, Runze Li, Xindong Wu

Research output: Contribution to journalArticlepeer-review

23 Scopus citations

Abstract

In Big Data applications, massive datasets with huge numbers of observations are frequently encountered. To deal with such massive datasets, a divide-and-conquer scheme (e.g., MapReduce) is often used for the analysis of Big Data. With such a strategy, a large dataset (e.g., a centralized real database or a virtual database with distributed data sources) is first divided into smaller manageable segments; the final output is then aggregated from the individual outputs of the segments. Despite its popularity in practice, it remains largely unknown whether such a distributive strategy provides valid theoretical inferences to the original data. In this paper, we address this fundamental issue for the distributed kernel regression (DKR) problem, where the algorithmic feasibility is measured by the generalization performance of the resulting estimator. To justify DKR, a uniform convergence rate is needed for bounding the generalization error over the individual outputs, which brings new and challenging issues in the Big Data setup. Using a sample dependent kernel dictionary, we show that, with proper data segmentation, DKR leads to an estimator that is generalization consistent to the unknown regression function. This result theoretically justifies DKR and sheds light on more advanced distributive algorithms for processing Big Data. The promising performance of the method is supported by both simulation and real data examples.

Original languageEnglish (US)
Article number7520638
Pages (from-to)3041-3052
Number of pages12
JournalIEEE Transactions on Knowledge and Data Engineering
Volume28
Issue number11
DOIs
StatePublished - Nov 1 2016

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'On the Feasibility of Distributed Kernel Regression for Big Data'. Together they form a unique fingerprint.

Cite this