Interactive Benchmarks
📰 ArXiv cs.AI
arXiv:2603.04737v2 Announce Type: replace Abstract: Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses a model's reasoning abil
DeepCamp AI