TY - GEN
T1 - Efficient name disambiguation for large-scale databases
AU - Huang, Jian
AU - Ertekin, Seyda
AU - Giles, C. Lee
N1 - Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2006
Y1 - 2006
N2 - Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters papers by author. The distance metric between papers used in DBSCAN is calculated by an online active selection support vector machine algorithm (LASVM), yielding a simpler model, lower test errors and faster prediction time than a standard SVM. We prove that by recasting transitivity as density reachability in DBSCAN, transitivity is guaranteed for core points. For evaluation, we manually annotated 3,355 papers yielding 490 authors and achieved 90.6% pairwise-F1. For scalability, authors in the entire CiteSeer dataset, over 700,000 papers, were readily disambiguated.
AB - Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters papers by author. The distance metric between papers used in DBSCAN is calculated by an online active selection support vector machine algorithm (LASVM), yielding a simpler model, lower test errors and faster prediction time than a standard SVM. We prove that by recasting transitivity as density reachability in DBSCAN, transitivity is guaranteed for core points. For evaluation, we manually annotated 3,355 papers yielding 490 authors and achieved 90.6% pairwise-F1. For scalability, authors in the entire CiteSeer dataset, over 700,000 papers, were readily disambiguated.
UR - http://www.scopus.com/inward/record.url?scp=33750287715&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33750287715&partnerID=8YFLogxK
U2 - 10.1007/11871637_53
DO - 10.1007/11871637_53
M3 - Conference contribution
AN - SCOPUS:33750287715
SN - 3540453741
SN - 9783540453741
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 536
EP - 544
BT - Knowledge Discovery in Databases
PB - Springer Verlag
T2 - 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 2006
Y2 - 18 September 2006 through 22 September 2006
ER -