TY - GEN
T1 - Identifying table boundaries in digital documents via sparse line detection
AU - Liu, Ying
AU - Mitra, Prasenjit
AU - Giles, C. Lee
PY - 2008
Y1 - 2008
N2 - Most prior work on information extraction has focused on extracting information from text in digital documents. However, often, the most important information being reported in an article is presented in tabular form in a digital document. If the data reported in tables can be extracted and stored in a database, the data can be queried and joined with other data using database management systems. In order to prepare the data source for table search, accurately detecting the table boundary plays a crucial role for the later table structure decomposition. Table boundary detection and content extraction is a challenging problem because tabular formats are not standardized across all documents. In this paper, we propose a simple but effective preprocessing method to improve the table boundary detection performance by considering the sparse-line property of table rows. Our method easily simplifies the table boundary detection problem into the sparse line analysis problem with much less noise. We design eight line label types and apply two machine learning techniques, Conditional Random Field (CRF) and Support Vector Machines (SVM), on the table boundary detection field. The experimental results not only compare the performances between the machine learning methods and the heuristical-based method, but also demonstrate the effectiveness of the sparse line analysis in the table boundary detection.
AB - Most prior work on information extraction has focused on extracting information from text in digital documents. However, often, the most important information being reported in an article is presented in tabular form in a digital document. If the data reported in tables can be extracted and stored in a database, the data can be queried and joined with other data using database management systems. In order to prepare the data source for table search, accurately detecting the table boundary plays a crucial role for the later table structure decomposition. Table boundary detection and content extraction is a challenging problem because tabular formats are not standardized across all documents. In this paper, we propose a simple but effective preprocessing method to improve the table boundary detection performance by considering the sparse-line property of table rows. Our method easily simplifies the table boundary detection problem into the sparse line analysis problem with much less noise. We design eight line label types and apply two machine learning techniques, Conditional Random Field (CRF) and Support Vector Machines (SVM), on the table boundary detection field. The experimental results not only compare the performances between the machine learning methods and the heuristical-based method, but also demonstrate the effectiveness of the sparse line analysis in the table boundary detection.
UR - http://www.scopus.com/inward/record.url?scp=70349260831&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=70349260831&partnerID=8YFLogxK
U2 - 10.1145/1458082.1458255
DO - 10.1145/1458082.1458255
M3 - Conference contribution
AN - SCOPUS:70349260831
SN - 9781595939913
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 1311
EP - 1320
BT - Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08
T2 - 17th ACM Conference on Information and Knowledge Management, CIKM'08
Y2 - 26 October 2008 through 30 October 2008
ER -