Boosting the feature space: Text classification for unstructured data on the web

Song Yang, Zhou Ding, Huang Jian, Isaac G. Councill, Zha Hongyuan, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

10 Scopus citations

Abstract

The issue of seeking efficient and effective methods for classifying unstructured text in large document corpora has received much attention in recent years. Traditional document representation like bag-of-words encodes documents as feature vectors, which usually leads to sparse feature spaces with large dimensionality, thus making it hard to achieve high classification accuracies. This paper addresses the problem of classifying unstructured documents on the Web. A classification approach is proposed that utilizes traditional feature reduction techniques along with a collaborative filtering method for augmenting document feature spaces. The method produces feature spaces with an order of magnitude less features compared with a baseline bag-of-words feature selection method. Experiments on both real-world data and benchmark corpus indicate that our approach improves classification accuracy over the traditional methods for both Support Vector Machines and AdaBoost classifiers.

Original languageEnglish (US)
Title of host publicationProceedings - Sixth International Conference on Data Mining, ICDM 2006
Pages1064-1069
Number of pages6
DOIs
StatePublished - 2006
Event6th International Conference on Data Mining, ICDM 2006 - Hong Kong, China
Duration: Dec 18 2006Dec 22 2006

Publication series

NameProceedings - IEEE International Conference on Data Mining, ICDM
ISSN (Print)1550-4786

Other

Other6th International Conference on Data Mining, ICDM 2006
Country/TerritoryChina
CityHong Kong
Period12/18/0612/22/06

All Science Journal Classification (ASJC) codes

  • Engineering(all)

Fingerprint

Dive into the research topics of 'Boosting the feature space: Text classification for unstructured data on the web'. Together they form a unique fingerprint.

Cite this