EvalsAI DevelopmentAgentic WorkflowsDebugging Agents

The Most Valuable Part of Evals Cannot Be Automated

A simple, non-technical guide to fixing AI agents by analyzing what went wrong, measuring the impact, and improving systematically.

Ryan Brandt - Author
Ryan Brandt
AI Agent Consultant & Founder of Vunda AI
8 min read

Inspired by the course led by Hamel Husain and Shreya Shankar. The stories and examples are mine.


The Whole Loop in One Picture

The Eval Process: Analyze → Measure → Improve1Capture TracesSave completeconversation logsand tool outputsSetup: 1-2 hours2Get ~100 DiverseConversations• Define dimensions• Create test scenarios• Real or synthetic data• Stop at saturation2-4 hours3Open CodingRead each trace &write quick notes onwhat went wrong30 sec/trace4Axial CodingGroup similar failuresinto categories &build taxonomy1-2 hours5Measure ImpactCount frequency &severity. Prioritize byimpact × frequency30 min6Improve SystemFix top issues,add evaluators,then loop backVariableRepeat the cycleKey Outputs:• Full trace logs• Failure taxonomy• Priority matrix

Analyze → Measure → Improve

  1. Analyze: Look at what your AI did, spot what went wrong, name it.
  2. Measure: Count how often each type of problem shows up.
  3. Improve: Fix the biggest issues, then repeat.

The rest of this post walks through step 1 in detail and shows how it connects to the other two steps.


Step 1. Capture Complete Traces

Plain English: Save everything that happened in each conversation with the AI so you can replay it later. Tools like langsmith or brainstrust can be good for this, but vibe coding your own dashboards that interface with the backend systems the agent is built around is the best method.

A diagram showing the anatomy of a complete trace, including user input, system prompt, tool calls, and the final answer.

What a trace should include

The user's exact message (or voice transcript)
The instructions you gave the model (system or developer prompt)
Every model reply
Every tool call the model made, with inputs and outputs
The final answer the user saw

Example (mine): While building an agent that called pharmacies for Novo Nordisk, one call “confirmed” a prescription transfer but the pharmacy never received it. Because our trace stored the IVR steps, the DTMF keys pressed, and the tool outputs, we saw the agent pressed 0 instead of 1 and got routed to the wrong menu. Without the full trace we would've been helpless to fix it. 

⚠️

Common mistakes

  • Logging only the final answer
  • Storing tool outputs in a different system so reviewers cannot see them
  • Losing the system prompt or intermediate messages

Have the right tooling

Put everything on one screen when you annotate: trace, tool results, and (when relevant) the downstream system state (e.g., "was the tour actually scheduled?")
Flag or drop partial/garbled traces (import/logging bugs) so they don't pollute your stats
Consider a small "cheat sheet/codebook" field in your tool to keep phrasing consistent across reviewers
Watch for UI/modality mismatches (e.g., scheduling by chat vs. just giving a Calendly link). That's a legitimate failure class

Step 2. Get About 100 Diverse, Realistic Conversations

If you have production data, use it (with proper privacy). If you don't create synthetic conversations carefully.

Start with hypotheses & stress tests

  • Use your product (or watch real users) dozens of times adversarially: try the weird, annoying, or expensive cases. Note where it breaks.
  • Those hypotheses guide which dimensions you sample. “Realistic + high‑signal” beats “random + bland.”
  • Don't just ask an LLM for sample queries. It will happily generate 50 near-duplicates.
  • Brainstorm too many dimensions first (30–50). Then keep the 3–5 that actually change behavior.

Dimensions drive diversity

A dimension is an aspect of the request that meaningfully changes how your agent should behave. There can be many. The three below are examples, not a checklist. Your set depends on your product. Put your product hat on and empathize with your users. If you can't list the key dimensions, that's a loud signal you don't understand who you're building for or what they need.

Novo Nordisk pharmacy-calling agent (my project):

  • Intent: check stock, transfer a prescription, fill a prescription, verify insurance info
  • Persona: pharmacist vs pharmacy technician (techs often lack authority or information and need to transfer us)
  • Complexity / Channel noise: clear call vs bad audio, background noise, garbled transcription, asking for data the agent does not have

Other products will have different axes: language, jurisdiction, urgency, safety risk, etc. Brainstorm many, then keep the 3 to 5 that change behavior the most.

How to create useful synthetic data

  1. List values for each chosen dimension (at least three per dimension).
  2. Combine them into tuples (Intent x Persona x Complexity, for example).
  3. Turn each tuple into a realistic prompt or call scenario.
  4. Run the bot and log the traces.

Why ~100? The goal is to continue until you stop seeing brand new failure types. That plateau is called theoretical saturation. Stopping early usually means you missed costly edge cases.


Step 3. Open Coding: First Pass Notes

Read each trace and write one short note about what went wrong. Keep it fast.

Rules

  • Stop at the first upstream failure. Once it goes wrong, everything downstream is noise.
  • Use concrete language that you can later group. You're building raw material for a taxonomy.
  • Don't fix it yet! Now is the time to label, not fix. 
  • Don’t automate this (yet). Even GPT‑4.1 missed obvious errors in class; manual review is the "secret weapon."
  • User mistakes ≠ model failures. Only note typos/garble if the model mishandled them.
  • Logging glitches are their own issue. Mark and remove/fix partial traces rather than labeling them as model errors.

Novo Nordisk examples:

  • "Did not pass IVR. Pressed 0, needed to press 1. Routed to general store, never reached pharmacy"
  • "Got stuck in IVR loop, no escape command triggered"
  • "Misheard Rx number due to noise, asked for refill on wrong medication"
  • "Pharmacist asked for information we do not have (patient DOB) and conversation stalled"

Bad notes

  • Anything vague: "Confusing conversation", "Improve prompt", "Weird"
  • Typos from the user. Ignore them unless the model mishandled the typo in a meaningful way. Remember you will build categories from these notes. If a note is fuzzy, you hurt yourself later.

Aim for 10 to 30 seconds per trace once you find your rhythm.


Step 4. Axial Coding: Turn Notes Into Categories

A diagram showing the hierarchy of failure taxonomy, from raw notes to clusters to named categories.

This is where you build the taxonomy from your open codes. Group similar failures and name the buckets clearly.

How to do it

  • Dump all notes into a sheet or whiteboard
  • Cluster by similarity
  • Name each cluster so it suggests an action later ("Reschedule tool missing" > "Scheduling weirdness").

From taxonomy to evaluators

Right after you lock your buckets, sketch how you’ll measure them automatically (regex checks, tool-call audits, LLM graders, heuristics). Axial coding should hand a spec to whoever builds evaluators.

Novo Nordisk buckets we ended with:

  1. Failed to pass IVR (wrong key, wrong path, stuck loop)
  2. Repeated a question after the pharmacist already answered
  3. Asked for the wrong medication or misread numbers/letters
  4. Spoke numbers in a way humans could not parse (read full digits instead of one at a time)
  5. Used the wrong tool or no tool at all (Finished IVR and didn't call the Passed IVR state tool to change prompts to the pharmacist specific one)

You're going to have a long tail of issues, and that's ok! A few categories usually explain most pain.


Step 5. Measure What Matters

Now count. How many traces fall in each bucket? How severe are they?

Simple spreadsheet columns

  • Trace ID
  • Failure category
  • Severity (1–3) and business impact (cost/churn/risk score)
  • Frequency counter (auto-computed)
  • Short note / exemplar quote

Prioritize by Severity × Frequency (and cost if you can estimate it).

This gives you a heat map for prioritization.


Step 5½. Automate What You Can Evaluate

Before fixing, make the taxonomy actionable and automate deterministically whenever possible:

  • Prefer rules first: regexes, schema/format checks, and tool‑call validations wired into CI. (When we did Novo Nordisk, I used LLM graders, but simple regex + tool‑call checks in CI would have caught many issues just as effectively.)
  • Use LLM graders only when you can't easily write a deterministic check, and always spot‑check them, they have blind spots that aren't always obvious.
  • Add all evaluators to CI or nightly jobs so regressions pop immediately.

Step 6. Improve The System

With priorities in hand, fix systematically.

Novo Nordisk fixes:

  • Built separate IVR prompt snippets and key maps for each major chain (CVS, Walgreens, Rite Aid). We charted every branch so the agent knew exactly what to press and say. Skipping IVR was expensive, so this paid off fast.
  • Added explicit wording rules for pharmacists: read Rx and phone numbers one digit and one letter at a time.
  • Created intent checks for "transfer", "stock check", "fill" so the agent asked for the right thing up front.
  • Added a fallback: if agent requests info we do not have, immediately ask to be transferred to the pharmacist or to speak with someone who can access it.

Then we reran the loop, re-measured, and kept iterating.


Quick Reference Checklist

  1. Log full traces in one place
  2. List product-specific dimensions. If you cannot, revisit your user understanding
  3. Generate about 100 realistic conversations across those dimensions
  4. Open code each trace with one clear, concrete failure note
  5. Cluster notes into a taxonomy (axial coding)
  6. Measure frequency and severity
  7. Fix top issues, add evaluators, loop back

Stick this on your wall. It works.


Credit Where It’s Due

This process comes from the course I'm taking with Hamel Husain and Shreya Shankar. They deserve the credit for the framework. I did the work on Novo before taking their course, and was delighted that some of the things I developed to aid in the eval process were suggested by them. Still, their process is more refined than mine was, and so this also functions as a post mortem for me on how that project could have been improved. 

If you want help applying this to your own AI product, just reach out!

Ready to Build Production AI Agents?

Let's discuss how AI agents can transform your business operations

Book a Strategy Call