A supervised learning approach to entity matching between scholarly big datasets

Jian Wu, Athar Sefid, Allen C. Ge, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Scopus citations

Abstract

Bibliography metadata in scientific documents are essential in indexing and retrieval of scholarly big data for production search engines and bibliometrics research studies. Crawl-based digital library search engines can harvest millions of documents efficiently but metadata information extracted by automatic extractors are often noisy, incomplete, and/or with parsing errors. These metadata could be cleaned given a reference database. In this work, we develop a supervised machine learning based approach to match entities in a target database to a reference database, which can further be used to clean metadata in the target database. The approach leverages a number of features extracted from headers available from automatic extraction results. By adjusting combinations of hyper-parameters and various sampling strategies, the best results of Support Vector Machines, Logistic Regression, Random Forests, and Naïve Bayes models give comparable results, with F1-measure of about 90%, outperforming information retrieval only based method by about 14%, evaluated with cross validation.

Original languageEnglish (US)
Title of host publicationProceedings of the Knowledge Capture Conference, K-CAP 2017
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9781450355537
DOIs
StatePublished - Dec 4 2017
Event9th International Conference on Knowledge Capture, K-CAP 2017 - Austin, United States
Duration: Dec 4 2017Dec 6 2017

Publication series

NameProceedings of the Knowledge Capture Conference, K-CAP 2017

Other

Other9th International Conference on Knowledge Capture, K-CAP 2017
Country/TerritoryUnited States
CityAustin
Period12/4/1712/6/17

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Software
  • Computer Science Applications
  • Information Systems

Fingerprint

Dive into the research topics of 'A supervised learning approach to entity matching between scholarly big datasets'. Together they form a unique fingerprint.

Cite this