Rule-based word clustering for document metadata extraction

Hui Han, Kostas Tsioutsiouliklis, Eren Manavoglu, C. Lee Giles, Hongyuan Zha, Xiangmin Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

22 Scopus citations


Text classification is still an important problem for unlabeled text; CiteSeer, a computer science document search engine, uses automatic text classification methods for document indexing. Text classification uses a document's original text words as the primary feature representation. However, such representation usually comes with high dimensionality and feature sparseness. Word clustering is an effective approach to reduce feature dimensionality and feature sparseness, and improve text classification performance. This paper introduces a domain Rule-based word clustering method for cluster feature representation. The clusters are formed from various domain databases and the word orthographic properties. Besides significant dimensionality reduction, such cluster feature representations show a 6.6% absolute improvement on average on classification performance of document header lines and a 8.4% absolute improvement on the overall accuracy of bibliographic fields extraction, in contrast to feature representation just based on the original text words. Our word clustering even outperforms the distributional word clustering in the context of document metadata extraction.

Original languageEnglish (US)
Title of host publicationApplied Computing 2005 - Proceedings of the 20th Annual ACM Symposium on Applied Computing
Number of pages5
StatePublished - 2005
Event20th Annual ACM Symposium on Applied Computing - Santa Fe, NM, United States
Duration: Mar 13 2005Mar 17 2005


Other20th Annual ACM Symposium on Applied Computing
Country/TerritoryUnited States
CitySanta Fe, NM

All Science Journal Classification (ASJC) codes

  • Software


Dive into the research topics of 'Rule-based word clustering for document metadata extraction'. Together they form a unique fingerprint.

Cite this