Ekdeep Singh Lubana - From Probes to Rewards Using Interpretability to Shape Training

Name: Ekdeep Singh Lubana - From Probes to Rewards Using Interpretability to Shape Training
Uploaded: 2026-04-20T20:46:09Z
Channel: Cohere
Description: Ekdeep Singh Lubana — Guest Speaker @ Cohere Labs AI Safety & Alignment Reading Group Ekdeep is MTS at Goodfire, previously research fellow at Harvard's...

Cohere · Advanced ·🛡️ AI Safety & Ethics ·2w ago

Skills: AI Alignment Basics90%

Ekdeep Singh Lubana — Guest Speaker @ Cohere Labs AI Safety & Alignment Reading Group Ekdeep is MTS at Goodfire, previously research fellow at Harvard's Center for Brain Science. His recent work addresses some core issues with how we extract and use interpretability signals — showing that SAEs carry temporal assumptions mismatched to LM representations (Priors in Time), that SAE training is unstable across runs (Archetypal SAE), and that probes on model internals can serve as cheap RL reward signals, cutting hallucinations by 58% while remaining useful as monitors after training (Features as Rewards). He also gave a guest lecture at Stanford on what counts as an explanation in interp, which is good background viewing. This session is brought to you by the Cohere Labs Open Science Community - a space where ML researchers, engineers, linguists, social scientists, and lifelong learners connect and collaborate with each other. We'd like to extend a special thank you to Alif Munim and Abrar Frahman, Leads of our AI Safety and Alignment group for their dedication in organizing this event. If you’re interested in sharing your work, we welcome you to join us! Simply fill out the form at https://forms.gle/ALND9i6KouEEpCnz6 to express your interest in becoming a speaker. Join the Cohere Labs Open Science Community to see a full list of upcoming events (https://tinyurl.com/CohereLabsCommunityApp).

Watch on YouTube ↗ (saves to browser)