DeepClean: Data cleaning via question asking

Xinyang Zhang, Yujie Ji, Chanh Nguyen, Ting Wang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

As one critical task in the data analysis pipeline, data cleaning is notoriously human labor-intensive and error-prone. Knowledge base-assisted data cleaning has proved a powerful tool for finding and fixing data defects; however, its applicability is inevitably bounded by the natural limitations of knowledge bases. Meanwhile, although a vast number of knowledge sources exist in the form of free-text corpora (e.g., Wikipedia), transforming them into formats usable by existing data cleaning tools can be prohibitively costly and error-prone, if not at all impossible. Here, we present DeepClean, the first end-to-end data cleaning framework powered by free-text knowledge sources. At a high level, DeepClean leverages a knowledge source through its question-answering (QA) interface and achieves high-quality cleaning via iterative question asking. Specifically, DeepClean detects and repairs data defects in three stages: (i) Pattern extraction - it automatically discovers the semantic types of the data attributes as well as their correlations; (ii) Question generation - it translates each data tuple into a minimal set of validation questions; (iii) Completion and repair - by checking the answers returned by the knowledge source against the data values, it identifies erroneous cases and suggests possible fixes. Through extensive empirical studies, we demonstrate that DeepClean is applicable to a range of domains, and can effectively repair a variety of data defects, highlighting data cleaning powered by free-text knowledge sources as a promising direction for future research.

Original languageEnglish (US)
Title of host publicationProceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018
EditorsTina Eliassi-Rad, Wei Wang, Ciro Cattuto, Foster Provost, Rayid Ghani, Francesco Bonchi
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages283-292
Number of pages10
ISBN (Electronic)9781538650905
DOIs
StatePublished - Jan 31 2019
Event5th IEEE International Conference on Data Science and Advanced Analytics, DSAA 2018 - Turin, Italy
Duration: Oct 1 2018Oct 4 2018

Publication series

NameProceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018

Conference

Conference5th IEEE International Conference on Data Science and Advanced Analytics, DSAA 2018
Country/TerritoryItaly
CityTurin
Period10/1/1810/4/18

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Information Systems and Management
  • Statistics, Probability and Uncertainty
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'DeepClean: Data cleaning via question asking'. Together they form a unique fingerprint.

Cite this