GPT-5.1 scored 26%. Gemini 3 Flash scored 74%. Same prompt, same tools.
📰 Dev.to · ThomasP
In the previous article, I explained how we built the evaluation infrastructure for our AI agent: a...
In the previous article, I explained how we built the evaluation infrastructure for our AI agent: a...