TY - JOUR
T1 - ACL-Fig
T2 - 2023 Workshop on Scientific Document Understanding, SDU 2023
AU - Karishma, Zeba
AU - Rohatgi, Shaurya
AU - Puranik, Kavya Shrinivas
AU - Wu, Jian
AU - Lee Giles, C.
N1 - Publisher Copyright:
© 2022 Copyright for this paper by its authors.
PY - 2023
Y1 - 2023
N2 - Most existing large-scale academic search engines are built to retrieve text-based information. However, there are no large-scale retrieval services for scientific figures and tables. One challenge for such services is understanding scientific figures’ semantics, such as their types and purposes. A key ob- stacle is the need for datasets containing annotated scientific figures and tables, which can then be used for classification, question-answering, and auto-captioning. Here, we develop a pipeline that extracts figures and tables from the scientific lit- erature and a deep-learning-based framework that classifies scientific figures using visual features. Using this pipeline, we built the first large-scale automatically annotated corpus, ACL-FIG consisting of 112,052 scientific figures extracted from ≈ 56K research papers in the ACL Anthology. The ACL-FIG-PILOT dataset contains 1,671 manually labeled scientific figures belonging to 19 categories. The dataset is ac- cessible at https://huggingface.co/datasets/citeseerx/ACL-fig under a CC BY-NC license.
AB - Most existing large-scale academic search engines are built to retrieve text-based information. However, there are no large-scale retrieval services for scientific figures and tables. One challenge for such services is understanding scientific figures’ semantics, such as their types and purposes. A key ob- stacle is the need for datasets containing annotated scientific figures and tables, which can then be used for classification, question-answering, and auto-captioning. Here, we develop a pipeline that extracts figures and tables from the scientific lit- erature and a deep-learning-based framework that classifies scientific figures using visual features. Using this pipeline, we built the first large-scale automatically annotated corpus, ACL-FIG consisting of 112,052 scientific figures extracted from ≈ 56K research papers in the ACL Anthology. The ACL-FIG-PILOT dataset contains 1,671 manually labeled scientific figures belonging to 19 categories. The dataset is ac- cessible at https://huggingface.co/datasets/citeseerx/ACL-fig under a CC BY-NC license.
UR - http://www.scopus.com/inward/record.url?scp=85189167163&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85189167163&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85189167163
SN - 1613-0073
VL - 3656
JO - CEUR Workshop Proceedings
JF - CEUR Workshop Proceedings
Y2 - 14 February 2023
ER -