Cluster vector space model: A dimensionality reduction method for text classifications based on the vector quantization

Juxihong Julaiti, Soundar Kumara

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The Word Vector Space Model (WVSM), a widely used one in text analytics, provides an elegant way to enable computers to understand natural languages by converting words into numbers. With impressive results in syntactic and semantic tasks in natural language processing (NLP), WVSM is widely used in many search engines, information retrieval systems, as well as text classification. However, because the basic elements of the feature space are words the model has a high dimensionality and is at risk of overfitting. An advanced prediction system with multiple models can easily have a longer training time under WVSM. In this paper, a Cluster Vector Space Model (CVSM) based on vector quantization is used for the dimensionality reduction. This method transfers a given word vector space into a much smaller cluster vector space. The results indicate that the CVSM, with less than 1% of the original feature size, works at least as well as the WVSM in binary classification problem; in multi-class classification problems, with less than 1% of the original feature size, CVSM increases the performance of decision tree model.

Original languageEnglish (US)
Title of host publication67th Annual Conference and Expo of the Institute of Industrial Engineers 2017
EditorsHarriet B. Nembhard, Katie Coperich, Elizabeth Cudney
PublisherInstitute of Industrial Engineers
Pages428-433
Number of pages6
ISBN (Electronic)9780983762461
StatePublished - Jan 1 2017
Event67th Annual Conference and Expo of the Institute of Industrial Engineers 2017 - Pittsburgh, United States
Duration: May 20 2017May 23 2017

Publication series

Name67th Annual Conference and Expo of the Institute of Industrial Engineers 2017

Other

Other67th Annual Conference and Expo of the Institute of Industrial Engineers 2017
Country/TerritoryUnited States
CityPittsburgh
Period5/20/175/23/17

All Science Journal Classification (ASJC) codes

  • Industrial and Manufacturing Engineering

Fingerprint

Dive into the research topics of 'Cluster vector space model: A dimensionality reduction method for text classifications based on the vector quantization'. Together they form a unique fingerprint.

Cite this