TY - JOUR
T1 - Automated analysis of images in documents for intelligent document search
AU - Lu, Xiaonan
AU - Kataria, Saurabh
AU - Brouwer, William J.
AU - Wang, James Z.
AU - Mitra, Prasenjit
AU - Giles, C. Lee
N1 - Copyright:
Copyright 2009 Elsevier B.V., All rights reserved.
PY - 2009
Y1 - 2009
N2 - Authors use images to present a wide variety of important information in documents. For example, two-dimensional (2-D) plots display important data in scientific publications. Often, end-users seek to extract this data and convert it into a machine-processible form so that the data can be analyzed automatically or compared with other existing data. Existing document data extraction tools are semi-automatic and require users to provide metadata and interactively extract the data. In this paper, we describe a system that extracts data from documents fully automatically, completely eliminating the need for human intervention. The system uses a supervised learning-based algorithm to classify figures in digital documents into five classes: photographs, 2-D plots, 3-D plots, diagrams, and others. Then, an integrated algorithm is used to extract numerical data from data points and lines in the 2-D plot images along with the axes and their labels, the data symbols in the figure's legend and their associated labels. We demonstrate that the proposed system and its component algorithms are effective via an empirical evaluation. Our data extraction system has the potential to be a vital component in high volume digital libraries.
AB - Authors use images to present a wide variety of important information in documents. For example, two-dimensional (2-D) plots display important data in scientific publications. Often, end-users seek to extract this data and convert it into a machine-processible form so that the data can be analyzed automatically or compared with other existing data. Existing document data extraction tools are semi-automatic and require users to provide metadata and interactively extract the data. In this paper, we describe a system that extracts data from documents fully automatically, completely eliminating the need for human intervention. The system uses a supervised learning-based algorithm to classify figures in digital documents into five classes: photographs, 2-D plots, 3-D plots, diagrams, and others. Then, an integrated algorithm is used to extract numerical data from data points and lines in the 2-D plot images along with the axes and their labels, the data symbols in the figure's legend and their associated labels. We demonstrate that the proposed system and its component algorithms are effective via an empirical evaluation. Our data extraction system has the potential to be a vital component in high volume digital libraries.
UR - http://www.scopus.com/inward/record.url?scp=67650417928&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=67650417928&partnerID=8YFLogxK
U2 - 10.1007/s10032-009-0081-0
DO - 10.1007/s10032-009-0081-0
M3 - Article
AN - SCOPUS:67650417928
SN - 1433-2833
VL - 12
SP - 65
EP - 81
JO - International Journal on Document Analysis and Recognition
JF - International Journal on Document Analysis and Recognition
IS - 2
ER -