Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?

📰 ArXiv cs.AI

arXiv:2604.11061v1 Announce Type: cross Abstract: Mechanistic interpretability is often motivated for alignment auditing, where a model's verbal explanations can be absent, incomplete, or misleading. Yet many evaluations do not control whether black-box prompting alone can recover the target behavior, so apparent gains from white-box tools may reflect elicitation rather than internal signal; we call this the elicitation confounder. We introduce Pando, a model-organism benchmark that breaks this

Published 14 Apr 2026
Read full paper → ← Back to Reads