TY - GEN
T1 - Intelligent parsing of scanned volumes for web based archives
AU - Lu, Xiaonan
AU - Wang, James Z.
AU - Giles, C. Lee
N1 - Copyright:
Copyright 2008 Elsevier B.V., All rights reserved.
PY - 2007
Y1 - 2007
N2 - The proliferation of digital libraries and the large amount of existing documents raise important issues in efficient handling of documents. Printed texts in documents need to be converted into digital format and semantic information need to be parsed and managed for effective retrieval. In this work, we attempt to solve the problems faced by current web based archives, where large scale repositories of electronic resources have been built from scanned volumes. Specifically, we focus on the scientific domain and target scanned volumes of scientific publications. Our goal is to automate the semantic processing of scanned volumes, an important and challenging step towards efficient retrieval of content within scanned volumes. We tackle the problem by designing a machine learning-based method to extract multi-level metadata about content of scanned volumes. We combine image and text information within scanned volumes for intelligent parsing. We developed a system and test it with real world data from the Internet Archive, and the experimental evaluation has demonstrated good results.
AB - The proliferation of digital libraries and the large amount of existing documents raise important issues in efficient handling of documents. Printed texts in documents need to be converted into digital format and semantic information need to be parsed and managed for effective retrieval. In this work, we attempt to solve the problems faced by current web based archives, where large scale repositories of electronic resources have been built from scanned volumes. Specifically, we focus on the scientific domain and target scanned volumes of scientific publications. Our goal is to automate the semantic processing of scanned volumes, an important and challenging step towards efficient retrieval of content within scanned volumes. We tackle the problem by designing a machine learning-based method to extract multi-level metadata about content of scanned volumes. We combine image and text information within scanned volumes for intelligent parsing. We developed a system and test it with real world data from the Internet Archive, and the experimental evaluation has demonstrated good results.
UR - http://www.scopus.com/inward/record.url?scp=47749146503&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=47749146503&partnerID=8YFLogxK
U2 - 10.1109/ICSC.2007.47
DO - 10.1109/ICSC.2007.47
M3 - Conference contribution
AN - SCOPUS:47749146503
SN - 0769529976
SN - 9780769529974
T3 - ICSC 2007 International Conference on Semantic Computing
SP - 559
EP - 566
BT - ICSC 2007 International Conference on Semantic Computing
T2 - ICSC 2007 International Conference on Semantic Computing
Y2 - 17 September 2007 through 19 September 2007
ER -