Abstract
The emergence of XML leads the development of applications centric XML-documents. Often the documents contain tagged paragraphs of natural language texts. The extraction of relevant data from paragraphs confronts with their irregular structure hidden in the text and requires powerful extraction patterns. Although a large spectrum of wrappers has been conceived to mainly process HTML pages, the wrappers cannot deal with semi-structured data and cannot still take into consideration the natural language processing. In this paper, we present a specification language to write expressive and easy extraction patterns by casual users in a regular expression fashion. Moreover, we introduce the Xtractor, which relies on linguistic parsing of paragraphs and applies technical and natural language dictionaries.
Original language | English (US) |
---|---|
Pages | 150-155 |
Number of pages | 6 |
State | Published - Dec 1 2005 |
Event | 1st IEEE International Conference on Signal-Image Technology and Internet-Based Systems, SITIS 2005 - Yaounde, Cameroon Duration: Nov 27 2005 → Dec 1 2005 |
Conference
Conference | 1st IEEE International Conference on Signal-Image Technology and Internet-Based Systems, SITIS 2005 |
---|---|
Country/Territory | Cameroon |
City | Yaounde |
Period | 11/27/05 → 12/1/05 |
All Science Journal Classification (ASJC) codes
- Computer Networks and Communications
- Computer Vision and Pattern Recognition