DOCMATH-EVAL: Evaluating Math Reasoning Capabilities of LLMs in Understanding Financial Documents

Yilun Zhao, Yitao Long, Hongjun Liu, Ryo Kamoi, Linyong Nan, Lyuhao Chen, Yixin Liu, Xiangru Tang, Rui Zhang, Arman Cohan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Recent LLMs have demonstrated remarkable performance in solving exam-like math word problems. However, the degree to which these numerical reasoning skills are effective in real-world scenarios, particularly in expert domains, is still largely unexplored. This paper introduces DOCMATH-EVAL, a comprehensive benchmark specifically designed to evaluate the numerical reasoning capabilities of LLMs in the context of understanding and analyzing financial documents containing both text and tables. We evaluate a wide spectrum of 27 LLMs, including those specialized in math, coding and finance, with Chain-of-Thought and Program-of-Thought prompting methods. We found that even the current best-performing system (i.e., GPT-4) still significantly lags behind human experts in solving complex numerical reasoning problems grounded in long contexts. We believe DOCMATH-EVAL can be used as a valuable benchmark to evaluate LLMs' capabilities to solve challenging numerical reasoning problems in expert domains.

Original languageEnglish (US)
Title of host publicationLong Papers
EditorsLun-Wei Ku, Andre F. T. Martins, Vivek Srikumar
PublisherAssociation for Computational Linguistics (ACL)
Pages16103-16120
Number of pages18
ISBN (Electronic)9798891760943
StatePublished - 2024
Event62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Bangkok, Thailand
Duration: Aug 11 2024Aug 16 2024

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
Volume1
ISSN (Print)0736-587X

Conference

Conference62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
Country/TerritoryThailand
CityBangkok
Period8/11/248/16/24

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Linguistics and Language
  • Language and Linguistics

Fingerprint

Dive into the research topics of 'DOCMATH-EVAL: Evaluating Math Reasoning Capabilities of LLMs in Understanding Financial Documents'. Together they form a unique fingerprint.

Cite this