Enhancing cross document co reference of web documents with context similarity and very large scale text categorization

Jian Huang, Pucktada Treeratpituk, Sarah M. Taylor, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

Cross Document Co reference (CDC) is the task of constructing the co reference chain for mentions of a person across a set of documents. This work offers a holistic view of using document-level categories, sub-document level context and extracted entities and relations for the CDC task. We train a categorization component with an efficient flat algorithm using thousands of ODP categories and over a million web documents. We propose to use ranked categories as co reference information, particularly suitable for web documents that are widely different in style and content. An ensemble composite co reference function, amenable to inactive features, combines these three levels of evidence for disambiguation. A thorough feature importance study is conducted to analyze how these three components contribute to the co reference results. The overall solution is evaluated using the WePS benchmark data and demonstrate superior performance.

Original languageEnglish (US)
Title of host publicationColing 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference
Pages483-491
Number of pages9
Volume2
StatePublished - 2010
Event23rd International Conference on Computational Linguistics, Coling 2010 - Beijing, China
Duration: Aug 23 2010Aug 27 2010

Other

Other23rd International Conference on Computational Linguistics, Coling 2010
Country/TerritoryChina
CityBeijing
Period8/23/108/27/10

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Computational Theory and Mathematics
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Enhancing cross document co reference of web documents with context similarity and very large scale text categorization'. Together they form a unique fingerprint.

Cite this