MIRROR: A Hierarchical Benchmark for Metacognitive Calibration in Large Language Models

📰 ArXiv cs.AI

arXiv:2604.19809v1 Announce Type: new Abstract: We introduce MIRROR, a benchmark comprising eight experiments across four metacognitive levels that evaluates whether large language models can use self-knowledge to make better decisions. We evaluate 16 models from 8 labs across approximately 250,000 evaluation instances using five independent behavioral measurement channels. Core experiments are run across the full model roster; experiments with specialized infrastructure requirements report expl

Published 23 Apr 2026

Read full paper → ← Back to Reads