▶ Videos →

📰 Dev.to · Marcus Chen

24 articles · Updated every 3 hours · View all reads

All Articles 107,869 Blog Posts 119,293 Tech Tutorials 27,303 Research Papers 22,422 News 16,434 ⚡ AI Lessons

The 2am call that dropped before the user finished talking, and the week I spent finding out why my tracer never saw it

Dev.to · Marcus Chen 2d ago

The 2am call that dropped before the user finished talking, and the week I spent finding out why my tracer never saw it

The call came in at 2am. Not a page, an actual support recording, flagged by a customer who said our...

Position bias in LLM-as-judge flipped 18% of our verdicts

Dev.to · Marcus Chen 🧠 Large Language Models ⚡ AI Lesson 1w ago

Position bias in LLM-as-judge flipped 18% of our verdicts

TL;DR: Position bias in LLM-as-judge means the model favors whichever answer it reads first. We...

The 1.4 Seconds That Weren't on Any Span

Dev.to · Marcus Chen 1w ago

The 1.4 Seconds That Weren't on Any Span

On the morning of June 3rd, a customer on a live call sat through 1.4 seconds of dead air after she...

The Retry That Booked Mrs. Alvarez Twice

Dev.to · Marcus Chen 🤖 AI Agents & Automation ⚡ AI Lesson 1w ago

The Retry That Booked Mrs. Alvarez Twice

Week one of the pilot, our voice agent booked appointments for a dental group. Forty operatories,...

Bootstrap confidence intervals for your LLM eval metrics

Dev.to · Marcus Chen 🧠 Large Language Models ⚡ AI Lesson 1w ago

Bootstrap confidence intervals for your LLM eval metrics

TL;DR: A single eval number hides its own uncertainty. Eval confidence intervals from bootstrap...

Benchmarking 5 LLM providers on one eval set, no SDK per vendor

Dev.to · Marcus Chen 🧠 Large Language Models ⚡ AI Lesson 1w ago

Benchmarking 5 LLM providers on one eval set, no SDK per vendor

TL;DR: We run a 1,200-case eval suite for enterprise agent automation at Nexus Labs. Comparing models...

temperature=0 didn't make our LLM evals reproducible

Dev.to · Marcus Chen ⚡ AI Lesson 1w ago

temperature=0 didn't make our LLM evals reproducible

TL;DR: We set temperature=0 and seed=42 and still got different eval scores on the same 800-prompt...

Harvesting a regression test set from gateway logs with a plugin

Dev.to · Marcus Chen 📐 ML Fundamentals ⚡ AI Lesson 1w ago

Harvesting a regression test set from gateway logs with a plugin

TL;DR: Our eval sets went stale because a human wrote the test cases by hand once and never updated...

Perplexity held flat after INT4. Task accuracy dropped 7 points.

Dev.to · Marcus Chen 🧠 Large Language Models ⚡ AI Lesson 2w ago

Perplexity held flat after INT4. Task accuracy dropped 7 points.

TL;DR: We quantized a fine-tuned 14B agent model to INT4 with GPTQ. Perplexity moved 0.04. We almost...

A 9-point eval gain vanished when we deduped train against test

Dev.to · Marcus Chen 2w ago

A 9-point eval gain vanished when we deduped train against test

TL;DR: We fine-tuned an 8B model for an enterprise ticket-routing task and saw accuracy jump from 71%...

Our voice agent passed every test and still woke me up at 3am

Dev.to · Marcus Chen 3w ago

Our voice agent passed every test and still woke me up at 3am

Replaying real call transcripts as your test set is a trap. The failures come from the...

The 4-layer voice-agent latency stack, traced with OTel spans

Dev.to · Marcus Chen 3w ago

The 4-layer voice-agent latency stack, traced with OTel spans

** How I instrument ASR, LLM, TTS, and the client with OpenTelemetry, and which number in...

Provider drift broke our regression evals. We pinned versions through Bifrost.

Dev.to · Marcus Chen 1mo ago

Provider drift broke our regression evals. We pinned versions through Bifrost.

TL;DR: Our nightly agent regression suite dropped 4 points on a tool-calling metric with zero code or...

Aggregate eval scores hid a 14-point regression in one user segment

Dev.to · Marcus Chen 🧠 Large Language Models ⚡ AI Lesson 1mo ago

Aggregate eval scores hid a 14-point regression in one user segment

TL;DR: Our agent eval suite reported 87% pass rate before and after a fine-tune. The aggregate didn't...

Shadow-testing a fine-tuned 8B against gpt-4o-mini through Bifrost

Dev.to · Marcus Chen 1mo ago

Shadow-testing a fine-tuned 8B against gpt-4o-mini through Bifrost

TL;DR: We fine-tuned a Llama 3.1 8B for invoice line-item extraction. Before flipping production...

Continuous batching wrecked our p99 latency. Here's the trace.

Dev.to · Marcus Chen 🧠 Large Language Models ⚡ AI Lesson 1mo ago

Continuous batching wrecked our p99 latency. Here's the trace.

TL;DR: We turned on vLLM continuous batching for a throughput win and watched p99 latency 8x in the...

Virtual keys per tenant: ditching our custom LLM billing layer

Dev.to · Marcus Chen 1mo ago

Virtual keys per tenant: ditching our custom LLM billing layer

TL;DR: We had 11,247 lines of Python middleware handling per-tenant LLM cost attribution, rate...

LLM-as-judge variance broke our DPO training signal for 3 weeks

Dev.to · Marcus Chen 🧠 Large Language Models ⚡ AI Lesson 1mo ago

LLM-as-judge variance broke our DPO training signal for 3 weeks

TL;DR: Our DPO pipeline used a single LLM as the preference judge. Training reward climbed every run....

Token-level eval harness for tool-calling agents: what we wired up

Dev.to · Marcus Chen 🤖 AI Agents & Automation ⚡ AI Lesson 1mo ago

Token-level eval harness for tool-calling agents: what we wired up

TL;DR: We replaced our "did the agent finish the task" pass/fail eval with a token-level harness that...

Prefix caching in vLLM under multi-tenant agent traffic

Dev.to · Marcus Chen 🧠 Large Language Models ⚡ AI Lesson 1mo ago

Prefix caching in vLLM under multi-tenant agent traffic

TL;DR: We turned on vLLM's prefix cache for our agent workloads at Nexus Labs and watched TTFT drop...

We Audited Our Agent Tool-Call Traces. Half Our Eval Data Was Garbage.

Dev.to · Marcus Chen 🧠 Large Language Models ⚡ AI Lesson 1mo ago

We Audited Our Agent Tool-Call Traces. Half Our Eval Data Was Garbage.

TL;DR: We pulled 41,000 production agent traces at Nexus Labs to build a fine-tuning dataset. After a...

Why Your LLM Eval Harness Is Lying to You (And How to Fix It)

Dev.to · Marcus Chen 🧠 Large Language Models ⚡ AI Lesson 1mo ago

Why Your LLM Eval Harness Is Lying to You (And How to Fix It)

TL;DR: Most eval harnesses I see in production are measuring the wrong thing. They report 87% pass...

Measuring AI Gateway Failover: 30 Days of Production Data

Dev.to · Marcus Chen 🤖 AI Agents & Automation ⚡ AI Lesson 1mo ago

Measuring AI Gateway Failover: 30 Days of Production Data

TL;DR: We measured failover latency across three AI gateways (Bifrost, LiteLLM, Portkey) during 30...

What Gemma 4's multi-token prediction head actually means for your eval pipeline

Dev.to · Marcus Chen 🧠 Large Language Models ⚡ AI Lesson 2mo ago

What Gemma 4's multi-token prediction head actually means for your eval pipeline

Gemma 4 dropped with a multi-token prediction (MTP) head and immediately every benchmark thread on...