A supervised learning approach to entity matching between scholarly big datasets

Jian Wu, Athar Sefid, Allen C. Ge, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

10 Scopus citations

Abstract

Bibliography metadata in scientific documents are essential in indexing and retrieval of scholarly big data for production search engines and bibliometrics research studies. Crawl-based digital library search engines can harvest millions of documents efficiently but metadata information extracted by automatic extractors are often noisy, incomplete, and/or with parsing errors. These metadata could be cleaned given a reference database. In this work, we develop a supervised machine learning based approach to match entities in a target database to a reference database, which can further be used to clean metadata in the target database. The approach leverages a number of features extracted from headers available from automatic extraction results. By adjusting combinations of hyper-parameters and various sampling strategies, the best results of Support Vector Machines, Logistic Regression, Random Forests, and Naïve Bayes models give comparable results, with F1-measure of about 90%, outperforming information retrieval only based method by about 14%, evaluated with cross validation.

Original languageEnglish (US)
Title of host publicationProceedings of the Knowledge Capture Conference, K-CAP 2017
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9781450355537
DOIs
StatePublished - Dec 4 2017
Event9th International Conference on Knowledge Capture, K-CAP 2017 - Austin, United States
Duration: Dec 4 2017Dec 6 2017

Publication series

NameProceedings of the Knowledge Capture Conference, K-CAP 2017

Other

Other9th International Conference on Knowledge Capture, K-CAP 2017
Country/TerritoryUnited States
CityAustin
Period12/4/1712/6/17

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Software
  • Computer Science Applications
  • Information Systems

Fingerprint

Dive into the research topics of 'A supervised learning approach to entity matching between scholarly big datasets'. Together they form a unique fingerprint.

Cite this