TY - GEN
T1 - Efficient topic-based unsupervised name disambiguation
AU - Song, Yang
AU - Huang, Jian
AU - Councill, Isaac G.
AU - Li, Jia
AU - Giles, C. Lee
PY - 2007
Y1 - 2007
N2 - Name ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or even share the same name with other people. In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents. We present an efficient and effective two-stage approach to disambiguate names. In the first stage, two novel topic-based models are proposed by extending two hierarchical Bayesian text models, namely Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). Our models explicitly introduce a new variable for persons and learn the distribution of topics with regard to persons and words. After learning an initial model, the topic distributions are treated as feature sets and names are disambiguated by leveraging a hierarchical agglomerative clustering method. Experiments on web data and scientific documents from CiteSeer indicate that our approach consistently outperforms other unsupervised learning methods such as spectral clustering and DBSCAN clustering and could be extended to other research fields. We empirically addressed the issue of scalability by disambiguating authors in over 750,000 papers from the entire CiteSeer dataset.
AB - Name ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or even share the same name with other people. In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents. We present an efficient and effective two-stage approach to disambiguate names. In the first stage, two novel topic-based models are proposed by extending two hierarchical Bayesian text models, namely Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). Our models explicitly introduce a new variable for persons and learn the distribution of topics with regard to persons and words. After learning an initial model, the topic distributions are treated as feature sets and names are disambiguated by leveraging a hierarchical agglomerative clustering method. Experiments on web data and scientific documents from CiteSeer indicate that our approach consistently outperforms other unsupervised learning methods such as spectral clustering and DBSCAN clustering and could be extended to other research fields. We empirically addressed the issue of scalability by disambiguating authors in over 750,000 papers from the entire CiteSeer dataset.
UR - http://www.scopus.com/inward/record.url?scp=36348962507&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=36348962507&partnerID=8YFLogxK
U2 - 10.1145/1255175.1255243
DO - 10.1145/1255175.1255243
M3 - Conference contribution
AN - SCOPUS:36348962507
SN - 1595936440
SN - 9781595936448
T3 - Proceedings of the ACM International Conference on Digital Libraries
SP - 342
EP - 351
BT - Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007
T2 - 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment
Y2 - 18 June 2007 through 23 June 2007
ER -