LLMs Do Not Grade Essays Like Humans

📰 ArXiv cs.AI

LLMs do not grade essays like humans, with weak agreement between LLM-generated scores and human grades

intermediate Published 26 Mar 2026
Action Steps
  1. Evaluate the performance of LLMs in automated essay scoring using datasets with human-graded essays
  2. Compare the scores generated by LLMs with human grades to assess agreement
  3. Analyze the grading behavior of different LLM models, such as GPT and Llama, in an out-of-the-box setting
  4. Investigate the reasons for the weak agreement between LLM and human scores, such as differences in grading criteria or bias in the models
Who Needs to Know This

AI engineers and educators can benefit from understanding the limitations of LLMs in automated essay scoring, as it can inform the development of more accurate grading tools

Key Insight

💡 LLMs have limited ability to mimic human grading behavior, highlighting the need for further research and development in automated essay scoring

Share This
📝 LLMs struggle to grade essays like humans, with weak agreement between machine and human scores
Read full paper → ← Back to News