Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling

📰 ArXiv cs.AI

arXiv:2605.13801v1 Announce Type: cross Abstract: As generative AI models such as large language models (LLMs) become more pervasive, ensuring the safety, robustness, and overall trustworthiness of these systems is paramount. However, AI is currently facing a reproducibility crisis driven by unreliable evaluations and unrepeatable experimental results. While human raters are often used to assess models for utility and safety, they introduce divergent biases and subjective opinions into their ann

Published 14 May 2026
Read full paper → ← Back to Reads