WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

📰 ArXiv cs.AI

arXiv:2604.10988v1 Announce Type: new Abstract: Existing browser agent benchmarks face a fundamental trilemma: real-website benchmarks lack reproducibility due to content drift, controlled environments sacrifice realism by omitting real-web noise, and both require costly manual curation that limits scalability. We present WebForge, the first fully automated framework that resolves this trilemma through a four-agent pipeline -- Plan, Generate, Refine, and Validate -- that produces interactive, se

Published 14 Apr 2026
Read full paper → ← Back to Reads