Abstract
The rapid advancement of AI-powered Automated Essay Scoring (AES) systems has led to growing interest in their potential to supplement or even replace human scoring in educational contexts. However, questions remain regarding the extent to which these systems align with human judgment. In this study we investigated the scoring consistency between human raters and three generative AI systems (ChatGPT, GPT-3.5; Copilot, GPT-4o; and Gemini, Gemini 2.5 Pro) when they used the 6 + 1 Trait Writing Model to assess argumentative essays written by Korean 9th-grade students. Ninety-two student essays were independently scored by trained human raters and AI systems using multi-shot prompting techniques. Results revealed that AI systems consistently awarded higher scores with lower variability. Statistical analyses (MANOVA, intraclass correlation coefficients, and weighted kappa), showed strong internal consistency within both the human raters and AI systems, but only moderate agreement between them. No single AI system consistently aligned most closely with human scores across all traits, indicating variability in evaluation logic among AI systems. These findings highlight both the promise and limitations of AI-based AES, suggesting the need for hybrid human-AI scoring models to leverage the strengths of each.
| Original language | English (US) |
|---|---|
| Article number | 101031 |
| Journal | Assessing Writing |
| Volume | 68 |
| DOIs | |
| State | Published - Apr 2026 |
All Science Journal Classification (ASJC) codes
- Language and Linguistics
- Education
- Linguistics and Language
Fingerprint
Dive into the research topics of 'Evaluating the consistency between human raters and three AI systems on the scoring of argumentative essays'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver