TY - JOUR
T1 - Automated Scoring of Creative Problem Solving With Large Language Models
T2 - A Comparison of Originality and Quality Ratings
AU - Luchini, Simone A.
AU - Maliakkal, Nadine T.
AU - DiStefano, Paul V.
AU - Laverghetta, Antonio
AU - Patterson, John D.
AU - Beaty, Roger E.
AU - Reiter-Palmon, Roni
N1 - Publisher Copyright:
© (2025), (American Psychological Association). All rights reserved.
PY - 2025
Y1 - 2025
N2 - Creative problem solving is a naturalistic form of creative thinking involving the generation of solutions that are not only original but also of high quality (i.e., plausible and effective). Past work has shown that large language models (LLMs) can predict human originality ratings of responses to creativity tests. We extend this work to creative problem solving, examining whether both originality and quality can be automatically scored for a naturalistic creativity task. We gathered data from 10 studies, amounting to 3,243 participants who completed different items of the creative problem-solving task (CPST). We then fine-tuned two open-source LLMs, RoBERTa and GPT-2, and few-shot prompted two separate LLMs, Claude and Llama, to predict human ratings of originality and quality on the CPST. We compared LLM performance to two other scoring methods: elaboration and semantic distance. We found that RoBERTa and GPT-2 models predict human ratings of solution quality (RoBERTa, r = .83; GPT-2, r = .83) better than solution originality (RoBERTa, r = .79; GPT-2, r = .80). Moreover, we found that both models outperformed elaboration and semantic distance and generalized to new CPST items not in their training set, with stronger predictions for quality than originality on the holdout-prompt set. Few-shot prompting was less effective than fine-tuning at predicting both originality (r = .66–.11) and quality (r = .62–.26). We show for the first time that naturalistic creativity tasks can be automatically scored for both originality and quality. Open access is provided to the models and training data.
AB - Creative problem solving is a naturalistic form of creative thinking involving the generation of solutions that are not only original but also of high quality (i.e., plausible and effective). Past work has shown that large language models (LLMs) can predict human originality ratings of responses to creativity tests. We extend this work to creative problem solving, examining whether both originality and quality can be automatically scored for a naturalistic creativity task. We gathered data from 10 studies, amounting to 3,243 participants who completed different items of the creative problem-solving task (CPST). We then fine-tuned two open-source LLMs, RoBERTa and GPT-2, and few-shot prompted two separate LLMs, Claude and Llama, to predict human ratings of originality and quality on the CPST. We compared LLM performance to two other scoring methods: elaboration and semantic distance. We found that RoBERTa and GPT-2 models predict human ratings of solution quality (RoBERTa, r = .83; GPT-2, r = .83) better than solution originality (RoBERTa, r = .79; GPT-2, r = .80). Moreover, we found that both models outperformed elaboration and semantic distance and generalized to new CPST items not in their training set, with stronger predictions for quality than originality on the holdout-prompt set. Few-shot prompting was less effective than fine-tuning at predicting both originality (r = .66–.11) and quality (r = .62–.26). We show for the first time that naturalistic creativity tasks can be automatically scored for both originality and quality. Open access is provided to the models and training data.
UR - https://www.scopus.com/pages/publications/105001636126
UR - https://www.scopus.com/pages/publications/105001636126#tab=citedBy
U2 - 10.1037/aca0000736
DO - 10.1037/aca0000736
M3 - Article
AN - SCOPUS:105001636126
SN - 1931-3896
JO - Psychology of Aesthetics, Creativity, and the Arts
JF - Psychology of Aesthetics, Creativity, and the Arts
ER -