TY - GEN
T1 - Online Person Name Disambiguation with Constraints
AU - Khabsa, Madian
AU - Treeratpituk, Pucktada
AU - Giles, C. Lee
N1 - Publisher Copyright:
© 2015 ACM.
PY - 2015/6/21
Y1 - 2015/6/21
N2 - While many clustering techniques have been successfully applied to the person name disambiguation problem, most do not address two main practical issues: allowing constraints to be added to the clustering process, and allowing the data to be added incrementally without clustering the entire database. Constraints can be particularly useful especially in a system such as a digital library, where users are allowed to make corrections to the disambiguated result. For example, a user correction on a disambiguation result specifying that a record does not belong to an author could be kept as a cannot-link constraint to be used in any future disambiguation (such as when new documents are added). Besides such user corrections, constraints also allow background heuristics to be encoded into the disambiguation process. We propose a constraint-based clustering algorithm for person name disambiguation, based on DBSCAN combined with a pairwise distance based on random forests. We further propose an extension to the density-based clustering algorithm (DBSCAN) to handle online clustering so that the disambiguation process can be done iteratively as new data points are added. Our algorithm utilizes similarity features based on both metadata information and citation similarity. We implement two types of clustering constraints to demonstrate the concept. Experiments on the CiteSeer data show that our model can achieve 0.95 pairwise F1 and 0.79 cluster F1. The presence of constraints also consistently improves the disambiguation result across different combinations of features.
AB - While many clustering techniques have been successfully applied to the person name disambiguation problem, most do not address two main practical issues: allowing constraints to be added to the clustering process, and allowing the data to be added incrementally without clustering the entire database. Constraints can be particularly useful especially in a system such as a digital library, where users are allowed to make corrections to the disambiguated result. For example, a user correction on a disambiguation result specifying that a record does not belong to an author could be kept as a cannot-link constraint to be used in any future disambiguation (such as when new documents are added). Besides such user corrections, constraints also allow background heuristics to be encoded into the disambiguation process. We propose a constraint-based clustering algorithm for person name disambiguation, based on DBSCAN combined with a pairwise distance based on random forests. We further propose an extension to the density-based clustering algorithm (DBSCAN) to handle online clustering so that the disambiguation process can be done iteratively as new data points are added. Our algorithm utilizes similarity features based on both metadata information and citation similarity. We implement two types of clustering constraints to demonstrate the concept. Experiments on the CiteSeer data show that our model can achieve 0.95 pairwise F1 and 0.79 cluster F1. The presence of constraints also consistently improves the disambiguation result across different combinations of features.
UR - http://www.scopus.com/inward/record.url?scp=84952018463&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84952018463&partnerID=8YFLogxK
U2 - 10.1145/2756406.2756915
DO - 10.1145/2756406.2756915
M3 - Conference contribution
AN - SCOPUS:84952018463
T3 - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries
SP - 37
EP - 46
BT - JCDL 2015 - Proceedings of the 15th ACM/IEEE-CE Joint Conference on Digital Libraries
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 15th ACM/IEEE-CE Joint Conference on Digital Libraries, JCDL 2015
Y2 - 21 June 2015 through 25 June 2015
ER -