A lexicon-corpus-based unsupervised Chinese word segmentation approach

Lu Pengyu, Pu Jingchuan, Du Mingming, Lou Xiaojuan, Jin Lijun

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

This paper presents a Lexicon-Corpus-based Unsupervised (LCU) Chinese word segmentation approach to improve the Chinese word segmentation result. Specifically, it combines advantages of lexicon-based approach and Corpus-based approach to identify out-of-vocabulary (OOV) words and guarantee segmentation consistency of the actual words in texts as well. In addition, a Forward Maximum Fixed-count Segmentation (FMFS) algorithm is developed to identify phrases in texts at first. Detailed rules and experiment results of LCU are presented, too. Compared with lexicon-based approach or corpus-based approach, LCU approach makes a great improvement in Chinese word segmentation, especially for identifying n-char words. And also, two evaluation indexes are proposed to describe the effectiveness in extracting phrases, one is segmentation rate (S), and the other is segmentation consistency degree (D).

Original languageEnglish (US)
Pages (from-to)263-282
Number of pages20
JournalInternational Journal on Smart Sensing and Intelligent Systems
Volume7
Issue number1
DOIs
StatePublished - 2014

All Science Journal Classification (ASJC) codes

  • Control and Systems Engineering
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'A lexicon-corpus-based unsupervised Chinese word segmentation approach'. Together they form a unique fingerprint.

Cite this