Can We Locate and Prevent Stereotypes in LLMs?
📰 ArXiv cs.AI
arXiv:2604.19764v1 Announce Type: cross Abstract: Stereotypes in large language models (LLMs) can perpetuate harmful societal biases. Despite the widespread use of models, little is known about where these biases reside in the neural network. This study investigates the internal mechanisms of GPT 2 Small and Llama 3.2 to locate stereotype related activations. We explore two approaches: identifying individual contrastive neuron activations that encode stereotypes, and detecting attention heads th
DeepCamp AI