TY - JOUR
T1 - MIDIA
T2 - exploring denoising autoencoders for missing data imputation
AU - Ma, Qian
AU - Lee, Wang Chien
AU - Fu, Tao Yang
AU - Gu, Yu
AU - Yu, Ge
N1 - Funding Information:
This work is supported by the China Postdoctoral Science Foundation (2019M661077), the National Science Foundation (Grant No. IIS-1717084), the National Natural Science Foundation of China (Grant Nos. 61772102, 61751205), and the Liaoning Revitalization Talents Program (XLYC1807158).
Publisher Copyright:
© 2020, The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature.
PY - 2020/11/1
Y1 - 2020/11/1
N2 - Due to the ubiquitous presence of missing values (MVs) in real-world datasets, the MV imputation problem, aiming to recover MVs, is an important and fundamental data preprocessing step for various data analytics and mining tasks to effectively achieve good performance. To impute MVs, a typical idea is to explore the correlations amongst the attributes of the data. However, those correlations are usually complex and thus difficult to identify. Accordingly, we develop a new deep learning model called MIssing Data Imputation denoising Autoencoder (MIDIA) that effectively imputes the MVs in a given dataset by exploring non-linear correlations between missing values and non-missing values. Additionally, by considering various data missing patterns, we propose two effective MV imputation approaches based on the proposed MIDIA model, namely MIDIA-Sequential and MIDIA-Batch. MIDIA-Sequential imputes the MVs attribute-by-attribute sequentially by training an independent MIDIA model for each incomplete attribute. By contrast, MIDIA-Batch imputes the MVs in one batch by training a uniform MIDIA model. Finally, we evaluate the proposed approaches by experimentation in comparison with existing MV imputation algorithms. The experimental results demonstrate that both MIDIA-Sequential and MIDIA-Batch achieve significantly higher imputation accuracy compared with existing solutions, and the proposed approaches are capable of handling various data missing patterns and data types. Specifically, MIDIA-Sequential performs better than MIDIA-Batch for data with monotone missing pattern, while MIDIA-Batch performs better than MIDIA-Sequential for data with general missing pattern.
AB - Due to the ubiquitous presence of missing values (MVs) in real-world datasets, the MV imputation problem, aiming to recover MVs, is an important and fundamental data preprocessing step for various data analytics and mining tasks to effectively achieve good performance. To impute MVs, a typical idea is to explore the correlations amongst the attributes of the data. However, those correlations are usually complex and thus difficult to identify. Accordingly, we develop a new deep learning model called MIssing Data Imputation denoising Autoencoder (MIDIA) that effectively imputes the MVs in a given dataset by exploring non-linear correlations between missing values and non-missing values. Additionally, by considering various data missing patterns, we propose two effective MV imputation approaches based on the proposed MIDIA model, namely MIDIA-Sequential and MIDIA-Batch. MIDIA-Sequential imputes the MVs attribute-by-attribute sequentially by training an independent MIDIA model for each incomplete attribute. By contrast, MIDIA-Batch imputes the MVs in one batch by training a uniform MIDIA model. Finally, we evaluate the proposed approaches by experimentation in comparison with existing MV imputation algorithms. The experimental results demonstrate that both MIDIA-Sequential and MIDIA-Batch achieve significantly higher imputation accuracy compared with existing solutions, and the proposed approaches are capable of handling various data missing patterns and data types. Specifically, MIDIA-Sequential performs better than MIDIA-Batch for data with monotone missing pattern, while MIDIA-Batch performs better than MIDIA-Sequential for data with general missing pattern.
UR - http://www.scopus.com/inward/record.url?scp=85088600390&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85088600390&partnerID=8YFLogxK
U2 - 10.1007/s10618-020-00706-8
DO - 10.1007/s10618-020-00706-8
M3 - Article
AN - SCOPUS:85088600390
SN - 1384-5810
VL - 34
SP - 1859
EP - 1897
JO - Data Mining and Knowledge Discovery
JF - Data Mining and Knowledge Discovery
IS - 6
ER -