Application-Centric Evals: Stop Playing Whack-a-Mole

Or How to ship something people trust, come back to, and pay for. Inspired by Hamel Husain and Shreya Shankar's course, which I highly recommend.

TL;DR

Analyze → Measure → Improve is the flywheel. Live in "Analyze" longer than feels comfortable

Close the Three Gulfs: Comprehension (know your data), Specification (say exactly what good looks like), Generalization (make it stick in the wild)

Your prompt is a spec. Bad spec in, nonsense out. Start strong so your error analysis surfaces real issues, not self-inflicted wounds

Taste is the moat. Models will not infer your product's vibe or constraints. You have to encode it

LLM judges are useful, but a sloppy judge is a random number generator. Validate alignment before you trust the metric

Why Evals Matter (And Why They Feel So Painful)

If you build LLM products for a living, you already know the cycle: (1) ship a change, (2) metrics dip or a customer yells, (3) bash in a random fix, (4) repeat.

It's like a bad game of whack-a-mole: expensive, demoralizing, and completely in the dark. When a failure pops up, you have no idea what the root cause is. Is it the prompt? The retrieval step? The specific model version? The way you chained two tools together?

Without data, you're forced to guess. You pick one lever out of dozens ("let's try tweaking the prompt") and push a fix. But you're flying blind. You don't know if it solved the original problem, and you have no idea if it just created three new, more subtle bugs elsewhere. This is the cycle of pain.

The antidote is systematic evaluation that is tied to your application's real definition of quality. Not "does it solve GSM8K" quality. Your quality.

When someone says "we have evals", ask: Can you name the top three failure modes in production right now and their prevalence? If the answer is no, you don't really have evals, you're playing whack-a-mole.

The Eval Flywheel

1. Analyze (Superpower Phase)

You stare at traces. Lots of them. You label, cluster, and articulate failure modes in plain language. You resist the urge to outsource or auto-judge too early because the gold is in your own eyes-on-glass time.

Concrete output: A taxonomy of failure modes, each with examples of pass and fail. You stop when new data stops revealing new modes.

2. Measure

Now you scale. Convert qualitative nuggets into quantitative signals. Maybe that is an LLM judge, maybe regex, maybe a human-in-the-loop sampler. The bar: if the judge disagrees with you a lot, fix the judge before you fix the model.

Concrete output: A dashboard that tells you how often each failure mode shows up. No more “it feels better”.

3. Improve

Only now you reach for the toolbox: prompt tweaks, model swaps, RAG tweaks, agent decomposition, fine-tuning, cost optimizations. You pick levers based on prevalence × impact. Then you loop back to Analyze to see what broke.

The Three Gulfs Model: Why This Is Hard

To ship something users trust, you have to cross three gaps. Different hats, different muscles.

Gulf of Comprehension - Do you actually know what users are asking and what the system outputs across the long tail? You cannot read 10k chats, so you need sampling, clustering, and smart surfacing. Surprise is the signal you missed something.
Gulf of Specification - Humans are vague by default. “Give me an easy recipe” is useless. What is “easy”? Ingredient count? Total cook time? Skill floor? Until you pin those down as do’s and don’ts, the model will improvise in ways you don't enjoy.
Gulf of Generalization - LLMs do not generalize like humans. Passing A and B does not imply passing C. You need guardrails, decomposition, and continuous checks so “works on my prompt” does not turn into “oops in prod”.

If you feel whiplash switching hats between data science, product, and engineering, that is normal. There is no unicorn degree that teaches all three. The muscle is the context switch.

Prompts Are Specs, Not Poetry

A "good enough" initial prompt saves you weeks. Consider including:

Role + Objective: “You are X. Your goal is Y for user Z.”
Specific Guidelines: Ordered steps, constraints, hard don’ts.
Required Format: JSON schema, sections, keys. If you can parse it, you can measure it.
Relevant Context: User profile, retrieved docs, prior turns.
Examples (when needed): Show one or two good outputs. Not 20. Keep train vs test clean.

For example, turning a vague request into a concrete spec looks like this:

Vague Request: Write a welcome email.
Identified Failure (from Analysis): Emails are too generic and don't mention the user's specific plan.
The Fix (A Better Spec):

hljs prompt

You are an expert onboarding specialist for a SaaS company.
Your goal is to write a friendly, concise welcome email to a new user.

**User Details:**
- Name: {{user_name}}
- Plan: {{user_plan}}

**Instructions:**
1. Address the user by their first name.
2. Thank them for signing up.
3. Explicitly mention the **{{user_plan}}** they subscribed to.
4. Highlight one key benefit of their specific plan.
5. End with a clear call-to-action to log in and get started.

**Output Format:**
Return a single JSON object with the following keys:
- "subject": A string for the email subject line.
- "body": A string for the email body.

This simple change moves the failure from a vague "the model isn't creative enough" to a specific, measurable "the output didn't contain the user_plan value."

Skip chain-of-thought boilerplate unless you are on an ancient model. Modern reasoning models already “think”.

The point is not perfection. It is to avoid wasting error analysis cycles on “it did not return JSON” or “it forgot to ask about allergies” when you never told it to.

Taste Is The Moat

Anyone can stand up a generic agent. No one can copy your taste in what great looks like for your product unless you spell it out. Specs are how you inject that taste. As models improve, generic specs can shrink, but your taste layer stays.

LLM Judges: Friends With Caveats

LLM-on-LLM evals are powerful. They are also easy to get wrong:

Judge prompt misaligned with your expectations = garbage metrics
Leakage between few-shot examples and test data = overfit judge
No calibration vs human labels = delusion

Treat judges like any other model: prompt them well, validate, monitor drift.

Synthetic Data: Use It, Do Not Marry It

Bootstrapping with synthetic traces is great when you have lack logs. Still, ground yourself with real user data ASAP. Synthetic distributions lie in subtle ways.

How To Actually Start (A Mini SOP)

Draft the system prompt for your current app. Role, objective, do/don’t, format. Ship it to staging.
Collect 50–100 traces. If you have none, generate synthetic but sanity-check.
Open-code the failures. Write down every unique failure mode you see. Merge duplicates. Stop when saturation hits.
Prioritize by frequency × severity.
Write judge prompts or simple checks for top modes. Validate on a labeled subset.
Fix high ROI issues. Rinse and repeat.

Closing

If your current eval process feels like spreadsheets, scattered Slack threads, and “seems fine”, you are leaving money, trust, and sanity on the table. Put structure around your evals with an application-first lens. Obsess over error analysis. Encode your taste. Measure what matters. Then improve with purpose.

Do that, and you stop playing whack-a-mole and start compounding.

If you want help building out your own eval system or want to see examples of judge prompts and failure taxonomies, reach out. I'm always happy to share what I've learned.