TY - GEN
T1 - FOFO
T2 - 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
AU - Xia, Congying
AU - Xing, Chen
AU - Du, Jiangshu
AU - Yang, Xinyi
AU - Feng, Yihao
AU - Xu, Ran
AU - Yin, Wenpeng
AU - Xiong, Caiming
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - This paper presents FOFO, a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats, a crucial yet underexamined capability for their application as AI agents. Despite LLMs' advancements, existing benchmarks fail to assess their format-following proficiency adequately. FOFO fills this gap with a diverse range of real-world formats and instructions, developed through an AI-Human collaborative method. Our evaluation across both open-source (e.g., Llama 2, WizardLM) and closed-source (e.g., GPT-4, PALM2, Gemini) LLMs highlights three key findings: open-source models significantly lag behind closed-source ones in format adherence; LLMs' format-following performance is independent of their content generation quality; and LLMs' format proficiency varies across different domains. These insights suggest the need for specialized tuning for format-following skills and highlight FOFO's role in guiding the selection of domain-specific AI agents. FOFO is released here.
AB - This paper presents FOFO, a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats, a crucial yet underexamined capability for their application as AI agents. Despite LLMs' advancements, existing benchmarks fail to assess their format-following proficiency adequately. FOFO fills this gap with a diverse range of real-world formats and instructions, developed through an AI-Human collaborative method. Our evaluation across both open-source (e.g., Llama 2, WizardLM) and closed-source (e.g., GPT-4, PALM2, Gemini) LLMs highlights three key findings: open-source models significantly lag behind closed-source ones in format adherence; LLMs' format-following performance is independent of their content generation quality; and LLMs' format proficiency varies across different domains. These insights suggest the need for specialized tuning for format-following skills and highlight FOFO's role in guiding the selection of domain-specific AI agents. FOFO is released here.
UR - http://www.scopus.com/inward/record.url?scp=85204472806&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85204472806&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85204472806
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 680
EP - 699
BT - Long Papers
A2 - Ku, Lun-Wei
A2 - Martins, Andre F. T.
A2 - Srikumar, Vivek
PB - Association for Computational Linguistics (ACL)
Y2 - 11 August 2024 through 16 August 2024
ER -