What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

📰 ArXiv cs.AI

arXiv:2510.26202v2 Announce Type: replace-cross Abstract: Human feedback can alter language models in unpredictable and undesirable ways, as practitioners lack a clear understanding of what feedback data encodes. While prior work studies preferences over certain attributes (e.g., length or sycophancy), automatically extracting relevant features without pre-specifying hypotheses remains challenging. We introduce What's In My Human Feedback? (WIMHF), a method to explain feedback data using sparse

Published 14 Apr 2026
Read full paper → ← Back to Reads