TY - GEN
T1 - Table header detection and classification
AU - Fang, Jing
AU - Mitra, Prasenjit
AU - Tang, Zhi
AU - Giles, C. Lee
PY - 2012
Y1 - 2012
N2 - In digital libraries, a table, as a specific document component as well as a condensed way to present structured and relational data, contains rich information and often the only source of that information. In order to explore, retrieve, and reuse that data, tables should be identified and the data extracted. Table recognition is an old field of research. However, due to the diversity of table styles, the results are still far from satisfactory, and not a single algorithm performs well on all different types of tables. In this paper, we randomly take samples from the CiteSeer to investigate diverse table styles for automatic table extraction. We find that table headers are one of the main characteristics of complex table styles. We identify a set of features that can be used to segregate headers from tabular data and build a classifier to detect table headers. Our empirical evaluation on PDF documents shows that using a Random Forest classifier achieves an accuracy of 92%.
AB - In digital libraries, a table, as a specific document component as well as a condensed way to present structured and relational data, contains rich information and often the only source of that information. In order to explore, retrieve, and reuse that data, tables should be identified and the data extracted. Table recognition is an old field of research. However, due to the diversity of table styles, the results are still far from satisfactory, and not a single algorithm performs well on all different types of tables. In this paper, we randomly take samples from the CiteSeer to investigate diverse table styles for automatic table extraction. We find that table headers are one of the main characteristics of complex table styles. We identify a set of features that can be used to segregate headers from tabular data and build a classifier to detect table headers. Our empirical evaluation on PDF documents shows that using a Random Forest classifier achieves an accuracy of 92%.
UR - http://www.scopus.com/inward/record.url?scp=84868278666&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84868278666&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84868278666
SN - 9781577355687
T3 - Proceedings of the National Conference on Artificial Intelligence
SP - 599
EP - 605
BT - AAAI-12 / IAAI-12 - Proceedings of the 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference
T2 - 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference, AAAI-12 / IAAI-12
Y2 - 22 July 2012 through 26 July 2012
ER -