TY - JOUR
T1 - Aligning linguistic complexity with the difficulty of English texts for L2 learners based on CEFR levels
AU - Zhang, Xiaopeng
AU - Lu, Xiaofei
N1 - Publisher Copyright:
© The Author(s), 2025. Published by Cambridge University Press.
PY - 2025
Y1 - 2025
N2 - Selecting appropriate texts for second language (L2) learners is essential for effective education. However, current text difficulty models often inadequately classify materials for L2 learners by proficiency levels. This study addresses this deficiency by employing the Common European Framework of Reference for Languages (CEFR) as its foundational framework. A cohort of expert English-L2 educators classified 1,181 texts from the CommonLit Ease of Readability corpus into CEFR levels. A random forest model was then trained using 24 linguistic complexity features to predict the CEFR levels of English texts for L2 learners. The model achieved 62.6% exact-level accuracy across the six granular CEFR levels and 82.6% across the three overarching levels, outperforming a baseline model based on three existing readability formulas. Additionally, it identified shared and unique linguistic features across different CEFR levels, highlighting the necessity to adjust text classification models to accommodate the distinct linguistic profiles of low- and high-proficiency readers.
AB - Selecting appropriate texts for second language (L2) learners is essential for effective education. However, current text difficulty models often inadequately classify materials for L2 learners by proficiency levels. This study addresses this deficiency by employing the Common European Framework of Reference for Languages (CEFR) as its foundational framework. A cohort of expert English-L2 educators classified 1,181 texts from the CommonLit Ease of Readability corpus into CEFR levels. A random forest model was then trained using 24 linguistic complexity features to predict the CEFR levels of English texts for L2 learners. The model achieved 62.6% exact-level accuracy across the six granular CEFR levels and 82.6% across the three overarching levels, outperforming a baseline model based on three existing readability formulas. Additionally, it identified shared and unique linguistic features across different CEFR levels, highlighting the necessity to adjust text classification models to accommodate the distinct linguistic profiles of low- and high-proficiency readers.
UR - https://www.scopus.com/pages/publications/105014938528
UR - https://www.scopus.com/pages/publications/105014938528#tab=citedBy
U2 - 10.1017/S0272263125101125
DO - 10.1017/S0272263125101125
M3 - Article
AN - SCOPUS:105014938528
SN - 0272-2631
JO - Studies in Second Language Acquisition
JF - Studies in Second Language Acquisition
ER -