Ekdeep Singh Lubana - From Probes to Rewards Using Interpretability to Shape Training

Cohere · Advanced ·🛡️ AI Safety & Ethics ·2w ago
Ekdeep Singh Lubana — Guest Speaker @ Cohere Labs AI Safety & Alignment Reading Group Ekdeep is MTS at Goodfire, previously research fellow at Harvard's Center for Brain Science. His recent work addresses some core issues with how we extract and use interpretability signals — showing that SAEs carry temporal assumptions mismatched to LM representations (Priors in Time), that SAE training is unstable across runs (Archetypal SAE), and that probes on model internals can serve as cheap RL reward signals, cutting hallucinations by 58% while remaining useful as monitors after training (Features as Rewards). He also gave a guest lecture at Stanford on what counts as an explanation in interp, which is good background viewing. This session is brought to you by the Cohere Labs Open Science Community - a space where ML researchers, engineers, linguists, social scientists, and lifelong learners connect and collaborate with each other. We'd like to extend a special thank you to Alif Munim and Abrar Frahman, Leads of our AI Safety and Alignment group for their dedication in organizing this event. If you’re interested in sharing your work, we welcome you to join us! Simply fill out the form at https://forms.gle/ALND9i6KouEEpCnz6 to express your interest in becoming a speaker. Join the Cohere Labs Open Science Community to see a full list of upcoming events (https://tinyurl.com/CohereLabsCommunityApp).
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Behind the Scenes Hardening Firefox with Claude Mythos Preview
Learn how Mozilla used Claude Mythos to identify and fix hundreds of vulnerabilities in Firefox, improving browser security
Simon Willison's Blog
AI Alignment Might Be Optimizing the Wrong Objective
AI alignment might be optimizing the wrong objective, highlighting the need to redefine what alignment means and how it's achieved
Medium · AI
AI Alignment Might Be Optimizing the Wrong Objective
AI alignment might be optimizing the wrong objective, highlighting the need to redefine what alignment means and how it's achieved
Medium · Machine Learning
Cognitive Surrender: how much thinking should leaders outsource to AI?
Learn how leaders can effectively balance AI-driven insights with human judgment to avoid cognitive surrender
Medium · Data Science
Up next
Why you can’t love all animals and still eat meat
Vox
Watch →