Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

📰 ArXiv cs.AI

Researchers introduce Token Mapping Perturbation Attack (TOMPA), a new paradigm for attacking reward models in reinforcement learning from human feedback

advanced Published 6 Apr 2026

Action Steps

Understand the concept of reward hacking and its impact on reinforcement learning from human feedback
Recognize the limitations of existing attacks that operate within the semantic space
Apply the Token Mapping Perturbation Attack (TOMPA) framework to perform adversarial optimization in the token space
Analyze the effectiveness of TOMPA in exploiting RM biases and develop countermeasures to improve model robustness

Who Needs to Know This

AI researchers and engineers working on reinforcement learning and reward models can benefit from understanding TOMPA to improve the robustness of their models, while ML researchers and security experts can apply this knowledge to develop more secure RL systems

Key Insight

💡 TOMPA offers a fundamentally different approach to attacking reward models, operating in the token space rather than the semantic space