Hodoscope: Unsupervised Monitoring for AI Misbehaviors
📰 ArXiv cs.AI
arXiv:2604.11072v1 Announce Type: new Abstract: Existing approaches to monitoring AI agents rely on supervised evaluation: human-written rules or LLM-based judges that check for known failure modes. However, novel misbehaviors may fall outside predefined categories entirely and LLM-based judges can be unreliable. To address this, we formulate unsupervised monitoring, drawing an analogy to unsupervised learning. Rather than checking for specific misbehaviors, an unsupervised monitor assists human
DeepCamp AI