I Trained Probes to Catch AI Models Sandbagging
📰 Dev.to · Subhadip Mitra
TL;DR: I extracted "sandbagging directions" from three open-weight models and trained linear probes...
TL;DR: I extracted "sandbagging directions" from three open-weight models and trained linear probes...