Skip to main navigation Skip to search Skip to main content

Evaluating the consistency between human raters and three AI systems on the scoring of argumentative essays

Research output: Contribution to journalArticlepeer-review

Abstract

The rapid advancement of AI-powered Automated Essay Scoring (AES) systems has led to growing interest in their potential to supplement or even replace human scoring in educational contexts. However, questions remain regarding the extent to which these systems align with human judgment. In this study we investigated the scoring consistency between human raters and three generative AI systems (ChatGPT, GPT-3.5; Copilot, GPT-4o; and Gemini, Gemini 2.5 Pro) when they used the 6 + 1 Trait Writing Model to assess argumentative essays written by Korean 9th-grade students. Ninety-two student essays were independently scored by trained human raters and AI systems using multi-shot prompting techniques. Results revealed that AI systems consistently awarded higher scores with lower variability. Statistical analyses (MANOVA, intraclass correlation coefficients, and weighted kappa), showed strong internal consistency within both the human raters and AI systems, but only moderate agreement between them. No single AI system consistently aligned most closely with human scores across all traits, indicating variability in evaluation logic among AI systems. These findings highlight both the promise and limitations of AI-based AES, suggesting the need for hybrid human-AI scoring models to leverage the strengths of each.

Original languageEnglish (US)
Article number101031
JournalAssessing Writing
Volume68
DOIs
StatePublished - Apr 2026

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Education
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Evaluating the consistency between human raters and three AI systems on the scoring of argumentative essays'. Together they form a unique fingerprint.

Cite this