TY - GEN
T1 - Automatic extraction of figures from scholarly documents
AU - Choudhury, Sagnik Ray
AU - Mitra, Prasenjit
AU - Giles, Clyde Lee
N1 - Publisher Copyright:
© 2015 ACM.
PY - 2015/9/8
Y1 - 2015/9/8
N2 - Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple "figures" such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80%.
AB - Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple "figures" such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80%.
UR - http://www.scopus.com/inward/record.url?scp=84959235832&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84959235832&partnerID=8YFLogxK
U2 - 10.1145/2682571.2797085
DO - 10.1145/2682571.2797085
M3 - Conference contribution
AN - SCOPUS:84959235832
T3 - DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering
SP - 47
EP - 50
BT - DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering
PB - Association for Computing Machinery, Inc
T2 - ACM Symposium on Document Engineering, DocEng 2015
Y2 - 8 September 2015 through 11 September 2015
ER -