Intelligent parsing of scanned volumes for web based archives

Xiaonan Lu, James Z. Wang, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The proliferation of digital libraries and the large amount of existing documents raise important issues in efficient handling of documents. Printed texts in documents need to be converted into digital format and semantic information need to be parsed and managed for effective retrieval. In this work, we attempt to solve the problems faced by current web based archives, where large scale repositories of electronic resources have been built from scanned volumes. Specifically, we focus on the scientific domain and target scanned volumes of scientific publications. Our goal is to automate the semantic processing of scanned volumes, an important and challenging step towards efficient retrieval of content within scanned volumes. We tackle the problem by designing a machine learning-based method to extract multi-level metadata about content of scanned volumes. We combine image and text information within scanned volumes for intelligent parsing. We developed a system and test it with real world data from the Internet Archive, and the experimental evaluation has demonstrated good results.

Original languageEnglish (US)
Title of host publicationICSC 2007 International Conference on Semantic Computing
Pages559-566
Number of pages8
DOIs
StatePublished - 2007
EventICSC 2007 International Conference on Semantic Computing - Irvine CA, United States
Duration: Sep 17 2007Sep 19 2007

Publication series

NameICSC 2007 International Conference on Semantic Computing

Other

OtherICSC 2007 International Conference on Semantic Computing
Country/TerritoryUnited States
CityIrvine CA
Period9/17/079/19/07

All Science Journal Classification (ASJC) codes

  • General Computer Science
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Intelligent parsing of scanned volumes for web based archives'. Together they form a unique fingerprint.

Cite this