Abstract
This paper presents a Lexicon-Corpus-based Unsupervised (LCU) Chinese word segmentation approach to improve the Chinese word segmentation result. Specifically, it combines advantages of lexicon-based approach and Corpus-based approach to identify out-of-vocabulary (OOV) words and guarantee segmentation consistency of the actual words in texts as well. In addition, a Forward Maximum Fixed-count Segmentation (FMFS) algorithm is developed to identify phrases in texts at first. Detailed rules and experiment results of LCU are presented, too. Compared with lexicon-based approach or corpus-based approach, LCU approach makes a great improvement in Chinese word segmentation, especially for identifying n-char words. And also, two evaluation indexes are proposed to describe the effectiveness in extracting phrases, one is segmentation rate (S), and the other is segmentation consistency degree (D).
Original language | English (US) |
---|---|
Pages (from-to) | 263-282 |
Number of pages | 20 |
Journal | International Journal on Smart Sensing and Intelligent Systems |
Volume | 7 |
Issue number | 1 |
DOIs | |
State | Published - 2014 |
All Science Journal Classification (ASJC) codes
- Control and Systems Engineering
- Electrical and Electronic Engineering