TY - GEN
T1 - DeepClean
T2 - 5th IEEE International Conference on Data Science and Advanced Analytics, DSAA 2018
AU - Zhang, Xinyang
AU - Ji, Yujie
AU - Nguyen, Chanh
AU - Wang, Ting
N1 - Funding Information:
We would like to thank the anonymous reviewers for their valuable suggestions for improving this paper. This material is based upon work supported by the National Science Foundation under Grant No. 1566526 and 1718787. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Publisher Copyright:
© 2018 IEEE.
PY - 2019/1/31
Y1 - 2019/1/31
N2 - As one critical task in the data analysis pipeline, data cleaning is notoriously human labor-intensive and error-prone. Knowledge base-assisted data cleaning has proved a powerful tool for finding and fixing data defects; however, its applicability is inevitably bounded by the natural limitations of knowledge bases. Meanwhile, although a vast number of knowledge sources exist in the form of free-text corpora (e.g., Wikipedia), transforming them into formats usable by existing data cleaning tools can be prohibitively costly and error-prone, if not at all impossible. Here, we present DeepClean, the first end-to-end data cleaning framework powered by free-text knowledge sources. At a high level, DeepClean leverages a knowledge source through its question-answering (QA) interface and achieves high-quality cleaning via iterative question asking. Specifically, DeepClean detects and repairs data defects in three stages: (i) Pattern extraction - it automatically discovers the semantic types of the data attributes as well as their correlations; (ii) Question generation - it translates each data tuple into a minimal set of validation questions; (iii) Completion and repair - by checking the answers returned by the knowledge source against the data values, it identifies erroneous cases and suggests possible fixes. Through extensive empirical studies, we demonstrate that DeepClean is applicable to a range of domains, and can effectively repair a variety of data defects, highlighting data cleaning powered by free-text knowledge sources as a promising direction for future research.
AB - As one critical task in the data analysis pipeline, data cleaning is notoriously human labor-intensive and error-prone. Knowledge base-assisted data cleaning has proved a powerful tool for finding and fixing data defects; however, its applicability is inevitably bounded by the natural limitations of knowledge bases. Meanwhile, although a vast number of knowledge sources exist in the form of free-text corpora (e.g., Wikipedia), transforming them into formats usable by existing data cleaning tools can be prohibitively costly and error-prone, if not at all impossible. Here, we present DeepClean, the first end-to-end data cleaning framework powered by free-text knowledge sources. At a high level, DeepClean leverages a knowledge source through its question-answering (QA) interface and achieves high-quality cleaning via iterative question asking. Specifically, DeepClean detects and repairs data defects in three stages: (i) Pattern extraction - it automatically discovers the semantic types of the data attributes as well as their correlations; (ii) Question generation - it translates each data tuple into a minimal set of validation questions; (iii) Completion and repair - by checking the answers returned by the knowledge source against the data values, it identifies erroneous cases and suggests possible fixes. Through extensive empirical studies, we demonstrate that DeepClean is applicable to a range of domains, and can effectively repair a variety of data defects, highlighting data cleaning powered by free-text knowledge sources as a promising direction for future research.
UR - http://www.scopus.com/inward/record.url?scp=85062867879&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85062867879&partnerID=8YFLogxK
U2 - 10.1109/DSAA.2018.00039
DO - 10.1109/DSAA.2018.00039
M3 - Conference contribution
AN - SCOPUS:85062867879
T3 - Proceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018
SP - 283
EP - 292
BT - Proceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018
A2 - Eliassi-Rad, Tina
A2 - Wang, Wei
A2 - Cattuto, Ciro
A2 - Provost, Foster
A2 - Ghani, Rayid
A2 - Bonchi, Francesco
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 1 October 2018 through 4 October 2018
ER -