What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data
📰 ArXiv cs.AI
arXiv:2510.26202v2 Announce Type: replace-cross Abstract: Human feedback can alter language models in unpredictable and undesirable ways, as practitioners lack a clear understanding of what feedback data encodes. While prior work studies preferences over certain attributes (e.g., length or sycophancy), automatically extracting relevant features without pre-specifying hypotheses remains challenging. We introduce What's In My Human Feedback? (WIMHF), a method to explain feedback data using sparse
DeepCamp AI