TY - GEN
T1 - GPT-4 as an Effective Zero-Shot Evaluator for Scientific Figure Captions
AU - Hsu, Ting Yao
AU - Huang, Chieh Yang
AU - Rossi, Ryan
AU - Kim, Sungchul
AU - Giles, Clyde Lee
AU - Huang, Ting Hao Kenneth
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - There is growing interest in systems that generate captions for scientific figures. However, assessing these systems' output poses a significant challenge. Human evaluation requires academic expertise and is costly, while automatic evaluation depends on often low-quality author-written captions. This paper investigates using large language models (LLMs) as a cost-effective, reference-free method for evaluating figure captions. We first constructed SCICAP-EVAL, a human evaluation dataset that contains human judgments for 3,600 scientific figure captions, both original and machine-made, for 600 arXiv figures. We then prompted LLMs like GPT-4 and GPT-3 to score (1-6) each caption based on its potential to aid reader understanding, given relevant context such as figure-mentioning paragraphs. Results show that GPT-4, used as a zero-shot evaluator, outperformed all other models and even surpassed assessments made by Computer Science and Informatics undergraduates, achieving a Kendall correlation score of 0.401 with Ph.D. students' rankings.
AB - There is growing interest in systems that generate captions for scientific figures. However, assessing these systems' output poses a significant challenge. Human evaluation requires academic expertise and is costly, while automatic evaluation depends on often low-quality author-written captions. This paper investigates using large language models (LLMs) as a cost-effective, reference-free method for evaluating figure captions. We first constructed SCICAP-EVAL, a human evaluation dataset that contains human judgments for 3,600 scientific figure captions, both original and machine-made, for 600 arXiv figures. We then prompted LLMs like GPT-4 and GPT-3 to score (1-6) each caption based on its potential to aid reader understanding, given relevant context such as figure-mentioning paragraphs. Results show that GPT-4, used as a zero-shot evaluator, outperformed all other models and even surpassed assessments made by Computer Science and Informatics undergraduates, achieving a Kendall correlation score of 0.401 with Ph.D. students' rankings.
UR - http://www.scopus.com/inward/record.url?scp=85183289166&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85183289166&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85183289166
T3 - Findings of the Association for Computational Linguistics: EMNLP 2023
SP - 5464
EP - 5474
BT - Findings of the Association for Computational Linguistics
PB - Association for Computational Linguistics (ACL)
T2 - 2023 Findings of the Association for Computational Linguistics: EMNLP 2023
Y2 - 6 December 2023 through 10 December 2023
ER -