Efficient Process Reward Modeling via Contrastive Mutual Information

📰 ArXiv cs.AI

arXiv:2604.10660v1 Announce Type: cross Abstract: Recent research has devoted considerable effort to verifying the intermediate reasoning steps of chain-of-thought (CoT) trajectories using process reward models (PRMs) and other verifier models. However, training a PRM typically requires human annotators to assign reward scores to each reasoning step, which is both costly and time-consuming. Existing automated approaches, such as Monte Carlo (MC) estimation, also demand substantial computational

Published 14 Apr 2026
Read full paper → ← Back to Reads