Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

📰 ArXiv cs.AI

arXiv:2506.03610v3 Announce Type: replace Abstract: Large Language Model (LLM) agents are reshaping the game industry, by enabling more intelligent and human-preferable characters. Yet, current game benchmarks fall short of practical needs: they lack evaluations of diverse LLM capabilities across various game genres, studies of agentic modules crucial for complex gameplay, and fine-tuning datasets to adapt pre-trained LLMs into gaming agents. To fill these gaps, we present Orak, a benchmark for

Published 16 Apr 2026

Read full paper → ← Back to Reads