TY - GEN
T1 - A supervised learning approach to entity matching between scholarly big datasets
AU - Wu, Jian
AU - Sefid, Athar
AU - Ge, Allen C.
AU - Giles, C. Lee
PY - 2017/12/4
Y1 - 2017/12/4
N2 - Bibliography metadata in scientific documents are essential in indexing and retrieval of scholarly big data for production search engines and bibliometrics research studies. Crawl-based digital library search engines can harvest millions of documents efficiently but metadata information extracted by automatic extractors are often noisy, incomplete, and/or with parsing errors. These metadata could be cleaned given a reference database. In this work, we develop a supervised machine learning based approach to match entities in a target database to a reference database, which can further be used to clean metadata in the target database. The approach leverages a number of features extracted from headers available from automatic extraction results. By adjusting combinations of hyper-parameters and various sampling strategies, the best results of Support Vector Machines, Logistic Regression, Random Forests, and Naïve Bayes models give comparable results, with F1-measure of about 90%, outperforming information retrieval only based method by about 14%, evaluated with cross validation.
AB - Bibliography metadata in scientific documents are essential in indexing and retrieval of scholarly big data for production search engines and bibliometrics research studies. Crawl-based digital library search engines can harvest millions of documents efficiently but metadata information extracted by automatic extractors are often noisy, incomplete, and/or with parsing errors. These metadata could be cleaned given a reference database. In this work, we develop a supervised machine learning based approach to match entities in a target database to a reference database, which can further be used to clean metadata in the target database. The approach leverages a number of features extracted from headers available from automatic extraction results. By adjusting combinations of hyper-parameters and various sampling strategies, the best results of Support Vector Machines, Logistic Regression, Random Forests, and Naïve Bayes models give comparable results, with F1-measure of about 90%, outperforming information retrieval only based method by about 14%, evaluated with cross validation.
UR - http://www.scopus.com/inward/record.url?scp=85040623747&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85040623747&partnerID=8YFLogxK
U2 - 10.1145/3148011.3154470
DO - 10.1145/3148011.3154470
M3 - Conference contribution
AN - SCOPUS:85040623747
T3 - Proceedings of the Knowledge Capture Conference, K-CAP 2017
BT - Proceedings of the Knowledge Capture Conference, K-CAP 2017
PB - Association for Computing Machinery, Inc
T2 - 9th International Conference on Knowledge Capture, K-CAP 2017
Y2 - 4 December 2017 through 6 December 2017
ER -