Xtractor: A light wrapper for XML paragraph-centric documents

Research output: Contribution to conferencePaperpeer-review

1 Scopus citations

Abstract

The emergence of XML leads the development of applications centric XML-documents. Often the documents contain tagged paragraphs of natural language texts. The extraction of relevant data from paragraphs confronts with their irregular structure hidden in the text and requires powerful extraction patterns. Although a large spectrum of wrappers has been conceived to mainly process HTML pages, the wrappers cannot deal with semi-structured data and cannot still take into consideration the natural language processing. In this paper, we present a specification language to write expressive and easy extraction patterns by casual users in a regular expression fashion. Moreover, we introduce the Xtractor, which relies on linguistic parsing of paragraphs and applies technical and natural language dictionaries.

Original languageEnglish (US)
Pages150-155
Number of pages6
StatePublished - Dec 1 2005
Event1st IEEE International Conference on Signal-Image Technology and Internet-Based Systems, SITIS 2005 - Yaounde, Cameroon
Duration: Nov 27 2005Dec 1 2005

Conference

Conference1st IEEE International Conference on Signal-Image Technology and Internet-Based Systems, SITIS 2005
Country/TerritoryCameroon
CityYaounde
Period11/27/0512/1/05

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'Xtractor: A light wrapper for XML paragraph-centric documents'. Together they form a unique fingerprint.

Cite this