Failure of contextual invariance in gender inference with large language models

📰 ArXiv cs.AI

Large language models' gender inference outputs are unstable under contextually equivalent formulations of a task

advanced Published 25 Mar 2026

Action Steps

Evaluate large language models on controlled pronoun selection tasks to assess contextual invariance
Analyze model outputs for systematic shifts induced by minimal discourse context changes
Investigate correlations between model outputs and cultural gender stereotypes

Who Needs to Know This

AI engineers and ML researchers benefit from understanding the limitations of large language models in gender inference tasks, as it affects the development of fair and unbiased AI systems

Key Insight

💡 Large language models' outputs are not stable under contextually equivalent formulations of a task, which can perpetuate cultural gender stereotypes