Abstract
Cross Document Co reference (CDC) is the task of constructing the co reference chain for mentions of a person across a set of documents. This work offers a holistic view of using document-level categories, sub-document level context and extracted entities and relations for the CDC task. We train a categorization component with an efficient flat algorithm using thousands of ODP categories and over a million web documents. We propose to use ranked categories as co reference information, particularly suitable for web documents that are widely different in style and content. An ensemble composite co reference function, amenable to inactive features, combines these three levels of evidence for disambiguation. A thorough feature importance study is conducted to analyze how these three components contribute to the co reference results. The overall solution is evaluated using the WePS benchmark data and demonstrate superior performance.
Original language | English (US) |
---|---|
Title of host publication | Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference |
Pages | 483-491 |
Number of pages | 9 |
Volume | 2 |
State | Published - 2010 |
Event | 23rd International Conference on Computational Linguistics, Coling 2010 - Beijing, China Duration: Aug 23 2010 → Aug 27 2010 |
Other
Other | 23rd International Conference on Computational Linguistics, Coling 2010 |
---|---|
Country/Territory | China |
City | Beijing |
Period | 8/23/10 → 8/27/10 |
All Science Journal Classification (ASJC) codes
- Language and Linguistics
- Computational Theory and Mathematics
- Linguistics and Language