PilotBench: A Benchmark for General Aviation Agents with Safety Constraints

📰 ArXiv cs.AI

arXiv:2604.08987v1 Announce Type: new Abstract: As Large Language Models (LLMs) advance toward embodied AI agents operating in physical environments, a fundamental question emerges: can models trained on text corpora reliably reason about complex physics while adhering to safety constraints? We address this through PilotBench, a benchmark evaluating LLMs on safety-critical flight trajectory and attitude prediction. Built from 708 real-world general aviation trajectories spanning nine operational

Published 13 Apr 2026

Read full paper → ← Back to Reads