I Trained Probes to Catch AI Models Sandbagging

📰 Dev.to · Subhadip Mitra

TL;DR: I extracted "sandbagging directions" from three open-weight models and trained linear probes...

Published 28 Dec 2025