✕ Clear all filters
15 articles

📰 Dev.to · Marcus Chen

15 articles · Updated every 3 hours · View all reads

All Articles 87,379Blog Posts 108,009Tech Tutorials 21,637Research Papers 18,891News 14,400 ⚡ AI Lessons
Aggregate eval scores hid a 14-point regression in one user segment
Dev.to · Marcus Chen 2w ago
Aggregate eval scores hid a 14-point regression in one user segment
TL;DR: Our agent eval suite reported 87% pass rate before and after a fine-tune. The aggregate didn't...
Shadow-testing a fine-tuned 8B against gpt-4o-mini through Bifrost
Dev.to · Marcus Chen 2w ago
Shadow-testing a fine-tuned 8B against gpt-4o-mini through Bifrost
TL;DR: We fine-tuned a Llama 3.1 8B for invoice line-item extraction. Before flipping production...
Continuous batching wrecked our p99 latency. Here's the trace.
Dev.to · Marcus Chen 2w ago
Continuous batching wrecked our p99 latency. Here's the trace.
TL;DR: We turned on vLLM continuous batching for a throughput win and watched p99 latency 8x in the...
Virtual keys per tenant: ditching our custom LLM billing layer
Dev.to · Marcus Chen 2w ago
Virtual keys per tenant: ditching our custom LLM billing layer
TL;DR: We had 11,247 lines of Python middleware handling per-tenant LLM cost attribution, rate...
LLM-as-judge variance broke our DPO training signal for 3 weeks
Dev.to · Marcus Chen 3w ago
LLM-as-judge variance broke our DPO training signal for 3 weeks
TL;DR: Our DPO pipeline used a single LLM as the preference judge. Training reward climbed every run....
Token-level eval harness for tool-calling agents: what we wired up
Dev.to · Marcus Chen 3w ago
Token-level eval harness for tool-calling agents: what we wired up
TL;DR: We replaced our "did the agent finish the task" pass/fail eval with a token-level harness that...
Prefix caching in vLLM under multi-tenant agent traffic
Dev.to · Marcus Chen 3w ago
Prefix caching in vLLM under multi-tenant agent traffic
TL;DR: We turned on vLLM's prefix cache for our agent workloads at Nexus Labs and watched TTFT drop...
We Audited Our Agent Tool-Call Traces. Half Our Eval Data Was Garbage.
Dev.to · Marcus Chen 3w ago
We Audited Our Agent Tool-Call Traces. Half Our Eval Data Was Garbage.
TL;DR: We pulled 41,000 production agent traces at Nexus Labs to build a fine-tuning dataset. After a...
Why Your LLM Eval Harness Is Lying to You (And How to Fix It)
Dev.to · Marcus Chen 3w ago
Why Your LLM Eval Harness Is Lying to You (And How to Fix It)
TL;DR: Most eval harnesses I see in production are measuring the wrong thing. They report 87% pass...
Measuring AI Gateway Failover: 30 Days of Production Data
Dev.to · Marcus Chen 3w ago
Measuring AI Gateway Failover: 30 Days of Production Data
TL;DR: We measured failover latency across three AI gateways (Bifrost, LiteLLM, Portkey) during 30...
What Gemma 4's multi-token prediction head actually means for your eval pipeline
Dev.to · Marcus Chen 2mo ago
What Gemma 4's multi-token prediction head actually means for your eval pipeline
Gemma 4 dropped with a multi-token prediction (MTP) head and immediately every benchmark thread on...