TY - JOUR
T1 - Regression analysis for massive datasets
AU - Fan, Tsai Hung
AU - Lin, Dennis K.J.
AU - Cheng, Kuang Fu
N1 - Funding Information:
The authors would like to thank the referees for their constructive comments. This work was supported by the MOE program for promoting academic excellence of universities under grant number 91-H-FA07-1-4.
PY - 2007/6
Y1 - 2007/6
N2 - In the past decades, we have witnessed a revolution in information technology. Routine collection of systematically generated data is now commonplace. Databases with hundreds of fields (variables), and billions of records (observations) are not unusual. This presents a difficulty for classical data analysis methods, mainly due to the limitation of computer memory and computational costs (in time, for example). In this paper, we propose an intelligent regression analysis methodology which is suitable for modeling massive datasets. The basic idea here is to split the entire dataset into several blocks, applying the classical regression techniques for data in each block, and finally combining these regression results via weighted averages. Theoretical justification of the goodness of the proposed method is given, and empirical performance based on extensive simulation study is discussed.
AB - In the past decades, we have witnessed a revolution in information technology. Routine collection of systematically generated data is now commonplace. Databases with hundreds of fields (variables), and billions of records (observations) are not unusual. This presents a difficulty for classical data analysis methods, mainly due to the limitation of computer memory and computational costs (in time, for example). In this paper, we propose an intelligent regression analysis methodology which is suitable for modeling massive datasets. The basic idea here is to split the entire dataset into several blocks, applying the classical regression techniques for data in each block, and finally combining these regression results via weighted averages. Theoretical justification of the goodness of the proposed method is given, and empirical performance based on extensive simulation study is discussed.
UR - http://www.scopus.com/inward/record.url?scp=34147190333&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=34147190333&partnerID=8YFLogxK
U2 - 10.1016/j.datak.2006.06.017
DO - 10.1016/j.datak.2006.06.017
M3 - Article
AN - SCOPUS:34147190333
SN - 0169-023X
VL - 61
SP - 554
EP - 562
JO - Data and Knowledge Engineering
JF - Data and Knowledge Engineering
IS - 3
ER -