Interpretable Coreference Resolution Evaluation Using Explicit Semantics
📰 ArXiv cs.AI
arXiv:2605.10627v1 Announce Type: cross Abstract: Coreference resolution is typically evaluated using aggregate statistical metrics such as CoNLL-F1, which measure structural overlap between predicted and gold clusters. While widely used, these metrics offer limited diagnostic insights, penalizing errors without revealing whether a system struggles with specific semantic categories, such as people, locations, or events, and making it difficult to interpret model capabilities or derive actionable
DeepCamp AI