TY - JOUR
T1 - REMIAN
T2 - Real-Time and Error-Tolerant Missing Value Imputation
AU - Ma, Qian
AU - Gu, Yu
AU - Lee, Wang Chien
AU - Yu, Ge
AU - Liu, Hongbo
AU - Wu, Xindong
N1 - Funding Information:
This work is partly supported by the China Postdoctoral Science Foundation (Grant No. 2019M661077), the National Natural Science Foundation of China (Grant Nos. 61772102, 61751205, and 61872070), National Science Foundation (Grant No. IIS-1717084), Liaoning Collaborative Fund (Grant No. 2020-HYLH-17), and Liaoning Revitalization Talents Program (Grant No. XLYC1807158). Authors’ addresses: Q. Ma, Dalian Maritime University, Dalian, China, 116026; email: [email protected]; Y. Gu (corresponding author), Northeastern University, Shenyang, China, 110819; email: [email protected]; W.-C. Lee, The Pennsylvania State University, State College; email: [email protected]; G. Yu, Northeastern University, Shenyang, China, 110819; email: [email protected]; H. Liu, Dalian Maritime University, Dalian, China, 116026; email: [email protected]; X. Wu, University of Louisiana at Lafayette, Lafayette; email: [email protected]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2020 Association for Computing Machinery. 1556-4681/2020/09-ART77 $15.00 https://doi.org/10.1145/3412364
Publisher Copyright:
© 2020 ACM.
PY - 2020/10
Y1 - 2020/10
N2 - Missing value (MV) imputation is a critical preprocessing means for data mining. Nevertheless, existing MV imputation methods are mostly designed for batch processing, and thus are not applicable to streaming data, especially those with poor quality. In this article, we propose a framework, called Real-time and Error-tolerant Missing vAlue ImputatioN (REMAIN), to impute MVs in poor-quality streaming data. Instead of imputing MVs based on all the observed data, REMAIN first initializes the MV imputation model based on a-RANSAC which is capable of detecting and rejecting anomalies in an efficient manner, and then incrementally updates the model parameters upon the arrival of new data to support real-time MV imputation. As the correlations among attributes of the data may change over time in unforseenable ways, we devise a deterioration detection mechanism to capture the deterioration of the imputation model to further improve the imputation accuracy. Finally, we conduct an extensive evaluation on the proposed algorithms using real-world and synthetic datasets. Experimental results demonstrate that REMAIN achieves significantly higher imputation accuracy over existing solutions. Meanwhile, REMAIN improves up to one order of magnitude in time cost compared with existing approaches.
AB - Missing value (MV) imputation is a critical preprocessing means for data mining. Nevertheless, existing MV imputation methods are mostly designed for batch processing, and thus are not applicable to streaming data, especially those with poor quality. In this article, we propose a framework, called Real-time and Error-tolerant Missing vAlue ImputatioN (REMAIN), to impute MVs in poor-quality streaming data. Instead of imputing MVs based on all the observed data, REMAIN first initializes the MV imputation model based on a-RANSAC which is capable of detecting and rejecting anomalies in an efficient manner, and then incrementally updates the model parameters upon the arrival of new data to support real-time MV imputation. As the correlations among attributes of the data may change over time in unforseenable ways, we devise a deterioration detection mechanism to capture the deterioration of the imputation model to further improve the imputation accuracy. Finally, we conduct an extensive evaluation on the proposed algorithms using real-world and synthetic datasets. Experimental results demonstrate that REMAIN achieves significantly higher imputation accuracy over existing solutions. Meanwhile, REMAIN improves up to one order of magnitude in time cost compared with existing approaches.
UR - http://www.scopus.com/inward/record.url?scp=85092695392&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85092695392&partnerID=8YFLogxK
U2 - 10.1145/3412364
DO - 10.1145/3412364
M3 - Article
AN - SCOPUS:85092695392
SN - 1556-4681
VL - 14
JO - ACM Transactions on Knowledge Discovery from Data
JF - ACM Transactions on Knowledge Discovery from Data
IS - 6
M1 - 3412364
ER -