An Interpretable and Scalable Framework for Evaluating Large Language Models
📰 ArXiv cs.AI
arXiv:2605.07046v1 Announce Type: cross Abstract: Evaluation of large language models (LLMs) is increasingly critical, yet standard benchmarking methods rely on average accuracy, overlooking both the inherent stochasticity of LLM outputs and the heterogeneity of benchmark items. Item Response Theory (IRT) offers a principled framework for modeling latent model abilities and item characteristics, but conventional methods are computationally expensive and numerically unstable, limiting large-scale
DeepCamp AI