Building a RAG Evaluation Harness That Actually Catches Problems

📰 Dev.to · Shiva Shrestha

I shipped a RAG chatbot without measurement, then built a proper eval harness. Hit@1 went from 60% to 80%, hallucination dropped from 41% to 28% and two metrics still fail. Here's the whole story.

Published 5 May 2026