Abstract
This paper presents a Lexicon-Corpus-based Unsupervised (LCU) Chinese word segmentation approach to improve the Chinese word segmentation result. Specifically, it combines advantages of lexicon-based approach and Corpus-based approach to identify out-of-vocabulary (OOV) words and guarantee segmentation consistency of the actual words in texts as well. In addition, a Forward Maximum Fixed-count Segmentation (FMFS) algorithm is developed to identify phrases in texts at first. Detailed rules and experiment results of LCU are presented, too. Compared with lexicon-based approach or corpus-based approach, LCU approach makes a great improvement in Chinese word segmentation, especially for identifying n-char words. And also, two evaluation indexes are proposed to describe the effectiveness in extracting phrases, one is segmentation rate (S), and the other is segmentation consistency degree (D).
| Original language | English (US) |
|---|---|
| Pages (from-to) | 263-282 |
| Number of pages | 20 |
| Journal | International Journal on Smart Sensing and Intelligent Systems |
| Volume | 7 |
| Issue number | 1 |
| DOIs | |
| State | Published - 2014 |
All Science Journal Classification (ASJC) codes
- Control and Systems Engineering
- Electrical and Electronic Engineering
Fingerprint
Dive into the research topics of 'A lexicon-corpus-based unsupervised Chinese word segmentation approach'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver