TY - GEN
T1 - Multi-LLM Collaborative Caption Generation in Scientific Documents
AU - Kim, Jaeyoung
AU - Lee, Jongho
AU - Choi, Hong Jun
AU - Hsu, Ting Yao
AU - Huang, Chieh Yang
AU - Kim, Sungchul
AU - Rossi, Ryan
AU - Yu, Tong
AU - Giles, Clyde Lee
AU - Huang, Ting Hao ‘Kenneth’
AU - Choi, Sungchul
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025
Y1 - 2025
N2 - Scientific figure captioning is a complex task that requires generating contextually appropriate descriptions of visual content. However, existing methods often fall short by utilizing incomplete information, treating the task solely as either an image-to-text or text summarization problem. This limitation hinders the generation of high-quality captions that fully capture the necessary details. Moreover, existing data sourced from arXiv papers contain low-quality captions, posing significant challenges for training large language models (LLMs). In this paper, we introduce a framework called Multi-LLM Collaborative Figure Caption Generation (MLBCAP) to address these challenges by leveraging specialized LLMs for distinct sub-tasks. Our approach unfolds in three key modules: (Quality Assessment) We utilize multimodal LLMs to assess the quality of training data, enabling the filtration of low-quality captions. (Diverse Caption Generation) We then employ a strategy of fine-tuning/prompting multiple LLMs on the captioning task to generate candidate captions. (Judgment) Lastly, we prompt a prominent LLM to select the highest quality caption from the candidates, followed by refining any remaining inaccuracies. Human evaluations demonstrate that informative captions produced by our approach rank better than human-written captions, highlighting its effectiveness. Our code is available at https://github.com/teamreboott/MLBCAP
AB - Scientific figure captioning is a complex task that requires generating contextually appropriate descriptions of visual content. However, existing methods often fall short by utilizing incomplete information, treating the task solely as either an image-to-text or text summarization problem. This limitation hinders the generation of high-quality captions that fully capture the necessary details. Moreover, existing data sourced from arXiv papers contain low-quality captions, posing significant challenges for training large language models (LLMs). In this paper, we introduce a framework called Multi-LLM Collaborative Figure Caption Generation (MLBCAP) to address these challenges by leveraging specialized LLMs for distinct sub-tasks. Our approach unfolds in three key modules: (Quality Assessment) We utilize multimodal LLMs to assess the quality of training data, enabling the filtration of low-quality captions. (Diverse Caption Generation) We then employ a strategy of fine-tuning/prompting multiple LLMs on the captioning task to generate candidate captions. (Judgment) Lastly, we prompt a prominent LLM to select the highest quality caption from the candidates, followed by refining any remaining inaccuracies. Human evaluations demonstrate that informative captions produced by our approach rank better than human-written captions, highlighting its effectiveness. Our code is available at https://github.com/teamreboott/MLBCAP
UR - https://www.scopus.com/pages/publications/105010816330
UR - https://www.scopus.com/pages/publications/105010816330#tab=citedBy
U2 - 10.1007/978-981-96-8912-5_6
DO - 10.1007/978-981-96-8912-5_6
M3 - Conference contribution
AN - SCOPUS:105010816330
SN - 9789819689118
T3 - Communications in Computer and Information Science
SP - 142
EP - 160
BT - AI for Research and Scalable, Efficient Systems - Second International Workshop, AI4Research 2025, and First International Workshop, SEAS 2025, Held in Conjunction with AAAI 2025, Proceedings
A2 - Wang, Qingyun
A2 - Yin, Wenpeng
A2 - Aich, Abhishek
A2 - Suh, Yumin
A2 - Peng, Kuan-Chuan
PB - Springer Science and Business Media Deutschland GmbH
T2 - 2nd AI4Research Workshop: Towards a Knowledge-Grounded Scientific Research Lifecycle, AI4Research 2025 and 1st Workshop on Scalable and Efficient Artificial Intelligence Systems, SEAS 2025, held in conjunction with the 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
Y2 - 25 February 2025 through 4 March 2025
ER -