Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure

📰 ArXiv cs.AI

Research questions whether evaluation awareness in large language models is just format sensitivity, finding probes track benchmark format rather than context

advanced Published 23 Mar 2026

Action Steps

Design a controlled 2x2 dataset to test the sensitivity of probes to prompt format
Use diagnostic rewrites to isolate the effect of prompt format on probe-based signals
Analyze the results to determine whether probes track context or surface structure
Consider the implications of the findings for the evaluation of large language models

Who Needs to Know This

ML researchers and AI engineers benefit from understanding the limitations of probe-based evidence in evaluating large language models, as it informs the design of more robust evaluation methods

Key Insight

💡 Probe-based evidence for evaluation awareness in large language models may be limited by format sensitivity