Using Code Evaluators in Phoenix
Skills:
AI Pair Programming80%
Key Takeaways
Uses code evaluators in Phoenix with sandboxed execution for custom logic
Original Description
In this walkthrough, Mikyo from the Phoenix open source team introduces code evaluators with sandboxed execution — now natively supported in Arize Phoenix.
Code evaluators let you write custom logic in Python or TypeScript to score your model outputs, no LLM-as-a-judge required (unless you want one). To run that code safely, Phoenix ships with two flavors of sandboxes:
Local sandboxes — WebAssembly and Deno, running directly on Phoenix with no network or third-party module access. Great for lightweight checks.
Hosted sandboxes — day-one support for E2B, Daytona, Vercel, and Modal, with network access and third-party libraries for more elaborate evaluation strategies.
Using a recipe-generation dataset as a running example, Mikyo walks through five evaluation patterns you can build with code evaluators:
Regex-based checks (a no-emoji evaluator running on WebAssembly)
Cosine similarity against a reference, using OpenAI embeddings inside a Daytona sandbox
Pairwise LLM-as-a-judge with position shuffling to reduce ordering bias
Composite evaluators that combine multiple weighted criteria (e.g., deliciousness + clarity) into a single score
LLM juries that aggregate judgments from multiple model providers (Anthropic + OpenAI) to get more balanced verdicts
Each evaluator is configured directly in the Phoenix UI, with sandbox providers, environment variables, and dependencies managed through sandbox configurations.
Try it out in Phoenix and let us know what you build.
🔗 Phoenix: https://phoenix.arize.com
📖 Docs: https://docs.arize.com/phoenix
#LLMEvaluation #AIObservability #Phoenix #ArizeAI
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: AI Pair Programming
View skill →Related AI Lessons
🎓
Tutor Explanation
DeepCamp AI