Using Code Evaluators in Phoenix

Arize AI · Intermediate ·📰 AI News & Updates ·1mo ago

Skills: AI Pair Programming80%

Key Takeaways

Uses code evaluators in Phoenix with sandboxed execution for custom logic

Original Description

In this walkthrough, Mikyo from the Phoenix open source team introduces code evaluators with sandboxed execution — now natively supported in Arize Phoenix. Code evaluators let you write custom logic in Python or TypeScript to score your model outputs, no LLM-as-a-judge required (unless you want one). To run that code safely, Phoenix ships with two flavors of sandboxes: Local sandboxes — WebAssembly and Deno, running directly on Phoenix with no network or third-party module access. Great for lightweight checks. Hosted sandboxes — day-one support for E2B, Daytona, Vercel, and Modal, with network access and third-party libraries for more elaborate evaluation strategies. Using a recipe-generation dataset as a running example, Mikyo walks through five evaluation patterns you can build with code evaluators: Regex-based checks (a no-emoji evaluator running on WebAssembly) Cosine similarity against a reference, using OpenAI embeddings inside a Daytona sandbox Pairwise LLM-as-a-judge with position shuffling to reduce ordering bias Composite evaluators that combine multiple weighted criteria (e.g., deliciousness + clarity) into a single score LLM juries that aggregate judgments from multiple model providers (Anthropic + OpenAI) to get more balanced verdicts Each evaluator is configured directly in the Phoenix UI, with sandbox providers, environment variables, and dependencies managed through sandbox configurations. Try it out in Phoenix and let us know what you build. 🔗 Phoenix: https://phoenix.arize.com 📖 Docs: https://docs.arize.com/phoenix #LLMEvaluation #AIObservability #Phoenix #ArizeAI

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: AI Pair Programming

View skill →

Build a JavaScript chat bot with us

Build a JavaScript chat bot with us

Live-code an emoji game with us | HTML, CSS & JavaScript

Live-code an emoji game with us | HTML, CSS & JavaScript

Group Coding: Working on the Coupon-API, Part 2

Group Coding: Working on the Coupon-API, Part 2

Can I Make Brick Breaker in One Hour - Coding Challenge

Can I Make Brick Breaker in One Hour - Coding Challenge

Speaking with a Webpage - Streaming Speech Transcripts

How to Make an Asteroids Game Bot (LIVE)

How to Make an Asteroids Game Bot (LIVE)

Related AI Lessons

Critical thinking in the AI Era

Develop critical thinking skills to navigate the AI era effectively and make informed decisions

Medium · Data Science

Anthropic Just Passed OpenAI Among Business Users. Here’s What That Means for Your Stack.

Anthropic surpasses OpenAI in business user adoption, impacting the AI stack for enterprises

AI: Energy Taker or Energy Maker

Learn how rising data center energy demands can catalyze a clean energy transition and why it matters for sustainable AI development

When AI Asks for More Electricity Than a Country Can Imagine

AI's increasing power consumption is causing concerns, learn why it matters for data centers and energy supply

Channels Television