Three Models of RLHF Annotation: Extension, Evidence, and Authority

📰 ArXiv cs.AI

arXiv:2604.25895v1 Announce Type: cross Abstract: Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the normative role of these judgments is rarely made explicit. I distinguish three conceptual models of that role. The first is extension: annotators extend the system designers' own judgments about what outputs should be. The second is evidence: annotators

Published 29 Apr 2026
Read full paper → ← Back to Reads