LLMs Do Not Grade Essays Like Humans

📰 ArXiv cs.AI

LLMs do not grade essays like humans, with weak agreement between LLM-generated scores and human grades

intermediate Published 26 Mar 2026

Action Steps

Evaluate the performance of LLMs in automated essay scoring using datasets with human-graded essays
Compare the scores generated by LLMs with human grades to assess agreement
Analyze the grading behavior of different LLM models, such as GPT and Llama, in an out-of-the-box setting
Investigate the reasons for the weak agreement between LLM and human scores, such as differences in grading criteria or bias in the models

Who Needs to Know This

AI engineers and educators can benefit from understanding the limitations of LLMs in automated essay scoring, as it can inform the development of more accurate grading tools

Key Insight

💡 LLMs have limited ability to mimic human grading behavior, highlighting the need for further research and development in automated essay scoring