SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents

📰 ArXiv cs.AI

arXiv:2604.07791v2 Announce Type: replace Abstract: Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have demonstrated significant potential in single-turn reasoning tasks. With the paradigm shift toward self-evolving agentic learning, models are increasingly expected to learn from trajectories by synthesizing tools or accumulating explicit experiences. However, prevailing methods typically rely on large-scale LLMs or multi-agent frameworks, which hinder their deployment

Published 14 Apr 2026

Read full paper → ← Back to Reads