Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces

📰 ArXiv cs.AI

arXiv:2604.11996v1 Announce Type: cross Abstract: Should we trust Large Language Models (LLMs) with high accuracy? LLMs achieve high accuracy on reasoning benchmarks, but correctness alone does not reveal the quality of the reasoning used to produce it. This highlights a fundamental limitation of outcome-based evaluation: models may arrive at correct answers through flawed reasoning, and models with substantially different reasoning capabilities can nevertheless exhibit similar benchmark accurac

Published 15 Apr 2026

Read full paper → ← Back to Reads