HatePrototypes: Interpretable and Transferable Representations for Implicit and Explicit Hate Speech Detection

📰 ArXiv cs.AI

HatePrototypes detects implicit and explicit hate speech using interpretable and transferable representations

advanced Published 7 Apr 2026

Action Steps

Identify existing hate speech benchmarks and their limitations
Develop new representations that capture implicit and indirect hate
Fine-tune models using these representations to improve detection accuracy
Evaluate and refine the models using transfer learning and interpretability metrics

Who Needs to Know This

AI engineers and researchers on a team can benefit from this research to improve hate speech detection models, while product managers can apply these findings to enhance content moderation systems

Key Insight

💡 Implicit hate speech detection requires novel representations that go beyond existing benchmarks

Key Takeaways

HatePrototypes detects implicit and explicit hate speech using interpretable and transferable representations

Full Article

Title: HatePrototypes: Interpretable and Transferable Representations for Implicit and Explicit Hate Speech Detection

Abstract:
arXiv:2511.06391v3 Announce Type: replace-cross Abstract: Optimization of offensive content moderation models for different types of hateful messages is typically achieved through continued pre-training or fine-tuning on new hate speech benchmarks. However, existing benchmarks mainly address explicit hate toward protected groups and often overlook implicit or indirect hate, such as demeaning comparisons, calls for exclusion or violence, and subtle discriminatory language that still causes harm.

Read full paper → ← Back to Reads