The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping
📰 ArXiv cs.AI
arXiv:2604.11297v1 Announce Type: cross Abstract: Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedly generates similar erroneous behaviors. Classical entropy regularization encourages randomness under the current policy, but does not explicitly discourage recurrent failure patterns across rollouts. We propose MEDS, a Memory-Enhanced Dynamic reward Shaping framework that incorporates historical
DeepCamp AI