Online Person Name Disambiguation with Constraints

Madian Khabsa, Pucktada Treeratpituk, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

27 Scopus citations

Abstract

While many clustering techniques have been successfully applied to the person name disambiguation problem, most do not address two main practical issues: allowing constraints to be added to the clustering process, and allowing the data to be added incrementally without clustering the entire database. Constraints can be particularly useful especially in a system such as a digital library, where users are allowed to make corrections to the disambiguated result. For example, a user correction on a disambiguation result specifying that a record does not belong to an author could be kept as a cannot-link constraint to be used in any future disambiguation (such as when new documents are added). Besides such user corrections, constraints also allow background heuristics to be encoded into the disambiguation process. We propose a constraint-based clustering algorithm for person name disambiguation, based on DBSCAN combined with a pairwise distance based on random forests. We further propose an extension to the density-based clustering algorithm (DBSCAN) to handle online clustering so that the disambiguation process can be done iteratively as new data points are added. Our algorithm utilizes similarity features based on both metadata information and citation similarity. We implement two types of clustering constraints to demonstrate the concept. Experiments on the CiteSeer data show that our model can achieve 0.95 pairwise F1 and 0.79 cluster F1. The presence of constraints also consistently improves the disambiguation result across different combinations of features.

Original languageEnglish (US)
Title of host publicationJCDL 2015 - Proceedings of the 15th ACM/IEEE-CE Joint Conference on Digital Libraries
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages37-46
Number of pages10
ISBN (Electronic)9781450335942
DOIs
StatePublished - Jun 21 2015
Event15th ACM/IEEE-CE Joint Conference on Digital Libraries, JCDL 2015 - Knoxville, United States
Duration: Jun 21 2015Jun 25 2015

Publication series

NameProceedings of the ACM/IEEE Joint Conference on Digital Libraries
Volume2015-June
ISSN (Print)1552-5996

Other

Other15th ACM/IEEE-CE Joint Conference on Digital Libraries, JCDL 2015
Country/TerritoryUnited States
CityKnoxville
Period6/21/156/25/15

All Science Journal Classification (ASJC) codes

  • General Engineering

Fingerprint

Dive into the research topics of 'Online Person Name Disambiguation with Constraints'. Together they form a unique fingerprint.

Cite this