EvalsAI DevelopmentLLMProduct

Application-Centric Evals: Stop Playing Whack-a-Mole

How to ship something people trust, come back to, and pay for. Inspired by Hamel Husain and Shreya Shankar's course.

Ryan Brandt - Author
Ryan Brandt
AI Agent Consultant & Founder of Vunda AI
7 min read

Or How to ship something people trust, come back to, and pay for. Inspired by Hamel Husain and Shreya Shankar's course, which I highly recommend.

TL;DR

Analyze → Measure → Improve is the flywheel. Live in "Analyze" longer than feels comfortable
Close the Three Gulfs: Comprehension (know your data), Specification (say exactly what good looks like), Generalization (make it stick in the wild)
Your prompt is a spec. Bad spec in, nonsense out. Start strong so your error analysis surfaces real issues, not self-inflicted wounds
Taste is the moat. Models will not infer your product's vibe or constraints. You have to encode it
LLM judges are useful, but a sloppy judge is a random number generator. Validate alignment before you trust the metric

Why Evals Matter (And Why They Feel So Painful)

If you build LLM products for a living, you already know the cycle:

1

Ship a change

2

Metrics dip or a customer yells

3

Bash random fix

4

Repeat

It's like a bad game of whack-a-mole: expensive, demoralizing, and completely in the dark. When a failure pops up, you have no idea what the root cause is. Is it the prompt? The retrieval step? The specific model version? The way you chained two tools together?

Without data, you're forced to guess. You pick one lever out of dozens—"let's try tweaking the prompt"—and push a fix. But you're flying blind. You don't know if it solved the original problem, and you have no idea if it just created three new, more subtle bugs elsewhere. This is the cycle of pain.

The antidote is systematic evaluation that is tied to your application’s real definition of quality. Not “does it solve GSM8K” quality. Your quality.

Without Proper Evals: The Whack-a-Mole CycleExpensive, demoralizing, and never-endingENDLESSLOOPShip Change😊 Optimistic startMetrics DipCustomer Yells😰 Panic sets inBash Random Fix🔨 Desperation modeNew IssuesEmerge😫 Here we go again💸 Expensive • 😔 Demoralizing • 🔄 Never-ending

When someone says “we have evals", ask: Can you name the top three failure modes in production right now and their prevalence? If the answer is no, you don't really have evals, you're playing whack-a-mole.

The Eval Flywheel

The Eval FlywheelAnalyze → Measure → ImproveEVALFLYWHEEL1. AnalyzeSuperpower Phase👁️ Stare at traces. Label. Cluster.Output: Taxonomy of failure modes2. MeasureScale Your Insights📊 Convert to signalsOutput: Dashboard, no more "feels better"3. ImprovePick Your Levers🔧 Fix by impactPrompts, models, RAG, fine-tuning💡 Pro tip: Live in "Analyze" longer than feels comfortableStop when new data stops revealing new modesThen loop back to Analyze to see what broke

1. Analyze (Superpower Phase)

You stare at traces. Lots of them. You label, cluster, and articulate failure modes in plain language. You resist the urge to outsource or auto-judge too early because the gold is in your own eyes-on-glass time.

Concrete output: A taxonomy of failure modes, each with examples of pass and fail. You stop when new data stops revealing new modes.

2. Measure

Now you scale. Convert qualitative nuggets into quantitative signals. Maybe that is an LLM judge, maybe regex, maybe a human-in-the-loop sampler. The bar: if the judge disagrees with you a lot, fix the judge before you fix the model.

Concrete output: A dashboard that tells you how often each failure mode shows up. No more “it feels better”.

3. Improve

Only now you reach for the toolbox: prompt tweaks, model swaps, RAG tweaks, agent decomposition, fine-tuning, cost optimizations. You pick levers based on prevalence × impact. Then you loop back to Analyze to see what broke.

The Three Gulfs Model: Why This Is Hard

To ship something users trust, you have to cross three gaps. Different hats, different muscles.

The Three Gulfs ModelWhy shipping trustworthy LLM products is hard?!Gulf of ComprehensionDo you know what users ask?Gulf of SpecificationCan you define "good"?Gulf of GeneralizationWill it work in the wild?StartUsers Trust ✓📊 Long tail of requestsCan't read 10k chatsNeed sampling & clustering🎯 Humans are vague"Easy recipe" = ?Pin down do's and don'ts🌍 LLMs ≠ humansPass A + B ≠ Pass CNeed continuous checks🎩 Different hats, different muscles: Data Science → Product → Engineering
  1. Gulf of Comprehension - Do you actually know what users are asking and what the system outputs across the long tail? You cannot read 10k chats, so you need sampling, clustering, and smart surfacing. Surprise is the signal you missed something.

  2. Gulf of Specification - Humans are vague by default. “Give me an easy recipe” is useless. What is “easy”? Ingredient count? Total cook time? Skill floor? Until you pin those down as do’s and don’ts, the model will improvise in ways you don't enjoy.

  3. Gulf of Generalization - LLMs do not generalize like humans. Passing A and B does not imply passing C. You need guardrails, decomposition, and continuous checks so “works on my prompt” does not turn into “oops in prod”.

If you feel whiplash switching hats between data science, product, and engineering, that is normal. There is no unicorn degree that teaches all three. The muscle is the context switch.

Prompts Are Specs, Not Poetry

From Vague Request to Clear Spec❌ Vague RequestWrite a welcome email.Problems:• What tone? • What details?✅ Clear Spec# Role + ObjectiveYou are an onboarding specialist.Write a welcome email.# User Context- Name: {{user_name}}- Plan: {{user_plan}}# Instructions1. Use first name2. Mention their plan3. Include plan benefit4. Clear CTA to log in# Output Format{ “subject”: “...”, “body”: “...” }Now measurable:✓ Contains plan ✓ Valid JSONTransformResult: Measurable Success❌ “The model isn’t creative enough” (vague)✅ “Output missing user_plan value” (specific)

A "good enough" initial prompt saves you weeks. Consider including:

  1. Role + Objective: “You are X. Your goal is Y for user Z.”
  2. Specific Guidelines: Ordered steps, constraints, hard don’ts.
  3. Required Format: JSON schema, sections, keys. If you can parse it, you can measure it.
  4. Relevant Context: User profile, retrieved docs, prior turns.
  5. Examples (when needed): Show one or two good outputs. Not 20. Keep train vs test clean.

For example, turning a vague request into a concrete spec looks like this:

  • Vague Request: Write a welcome email.
  • Identified Failure (from Analysis): Emails are too generic and don't mention the user's specific plan.
  • The Fix (A Better Spec):
You are an expert onboarding specialist for a SaaS company.
Your goal is to write a friendly, concise welcome email to a new user.

**User Details:**
- Name: {{user_name}}
- Plan: {{user_plan}}

**Instructions:**
1. Address the user by their first name.
2. Thank them for signing up.
3. Explicitly mention the **{{user_plan}}** they subscribed to.
4. Highlight one key benefit of their specific plan.
5. End with a clear call-to-action to log in and get started.

**Output Format:**
Return a single JSON object with the following keys:
- "subject": A string for the email subject line.
- "body": A string for the email body.

This simple change moves the failure from a vague "the model isn't creative enough" to a specific, measurable "the output didn't contain the user_plan value."

Skip chain-of-thought boilerplate unless you are on an ancient model. Modern reasoning models already “think”.

The point is not perfection. It is to avoid wasting error analysis cycles on “it did not return JSON” or “it forgot to ask about allergies” when you never told it to.

Taste Is The Moat

Anyone can stand up a generic agent. No one can copy your taste in what great looks like for your product unless you spell it out. Specs are how you inject that taste. As models improve, generic specs can shrink, but your taste layer stays.

LLM Judges: Friends With Caveats

LLM-on-LLM evals are powerful. They are also easy to get wrong:

  • Judge prompt misaligned with your expectations = garbage metrics
  • Leakage between few-shot examples and test data = overfit judge
  • No calibration vs human labels = delusion

Treat judges like any other model: prompt them well, validate, monitor drift.

Synthetic Data: Use It, Do Not Marry It

Bootstrapping with synthetic traces is great when you have lack logs. Still, ground yourself with real user data ASAP. Synthetic distributions lie in subtle ways.

How To Actually Start (A Mini SOP)

  1. Draft the system prompt for your current app. Role, objective, do/don’t, format. Ship it to staging.
  2. Collect 50–100 traces. If you have none, generate synthetic but sanity-check.
  3. Open-code the failures. Write down every unique failure mode you see. Merge duplicates. Stop when saturation hits.
  4. Prioritize by frequency × severity.
  5. Write judge prompts or simple checks for top modes. Validate on a labeled subset.
  6. Fix high ROI issues. Rinse and repeat.

Closing

If your current eval process feels like spreadsheets, scattered Slack threads, and “seems fine”, you are leaving money, trust, and sanity on the table. Put structure around your evals with an application-first lens. Obsess over error analysis. Encode your taste. Measure what matters. Then improve with purpose.

Do that, and you stop playing whack-a-mole and start compounding.

If you want help building out your own eval system or want to see examples of judge prompts and failure taxonomies, reach out. I'm always happy to share what I've learned.

Ready to Build Production AI Agents?

Let's discuss how AI agents can transform your business operations

Book a Strategy Call