Interactive Benchmarks

📰 ArXiv cs.AI

arXiv:2603.04737v2 Announce Type: replace Abstract: Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses a model's reasoning abil

Published 12 May 2026

Read full paper → ← Back to Reads