Can We Locate and Prevent Stereotypes in LLMs?

📰 ArXiv cs.AI

arXiv:2604.19764v1 Announce Type: cross Abstract: Stereotypes in large language models (LLMs) can perpetuate harmful societal biases. Despite the widespread use of models, little is known about where these biases reside in the neural network. This study investigates the internal mechanisms of GPT 2 Small and Llama 3.2 to locate stereotype related activations. We explore two approaches: identifying individual contrastive neuron activations that encode stereotypes, and detecting attention heads th

Published 23 Apr 2026

Read full paper → ← Back to Reads