Your AI is broken. You just don't know how broken yet.
You ship a customer service bot. Metrics look good. Users complain anyway.
Customer: "I ordered a blue widget but got red. Can I return it?"
Bot: "Absolutely! Return started. Label coming soon."
Customer: "Great! When will my blue widget ship?"
Bot: "I don't see a blue widget order. Want to place one?"
Customer: "WHAT? I just said I ordered it last week!"
Bot: "I understand you'd like to purchase a blue widget..."
[Customer rage-quits]
Where did it break? Turn 2? Turn 3? All of them?
Each response seems... reasonable? But the conversation is a trainwreck.
You have 200 similar threads. Reading them all will take three days, and you'll miss half the issues anyway.
There's a better way.
The Framework That Actually Works
The process has four steps:
(1) Analyze - understand what's failing by working with domain experts to identify patterns.
(2) Measure - build automated evaluators that catch specific failure modes.
(3) Improve - make targeted fixes and re-run evals to prove they worked.
(4) Repeat - integrate into CI/CD so quality never regresses.
The key insight: You can't measure what you don't understand.
Most teams jump straight to building automated evaluators and miss application-specific failures. Metrics look good. Users stay frustrated.
Let me show you the process that actually works.
Day 0: Define Your Dimensions
Before touching any traces, organize the problem space with dimensions - axes of variation where your AI is most likely to break.
The goal: actively try to break your system by targeting its weakest points.
Don't know where it will fail? Use your product yourself. Spend 2-3 hours actually using the AI like a real user would. The pain points will become obvious. If you don't have intuition about failure modes, you're not ready to evaluate yet.
For our hypothetical SupportBot (a customer service AI), I'd use these three dimensions:
1. Issue Type: Order status inquiry, return/refund request, product question, account issue, complaint/escalation.
2. Customer Context: New customer (first purchase), repeat customer, high-value customer (over $1000 lifetime), angry customer (complaint language).
3. Conversation Complexity: Single-turn (one question, one answer), multi-turn clarification (2-3 exchanges), complex negotiation (4+ turns), context-dependent (references earlier turns).
These aren't arbitrary. Each dimension targets a specific way the system breaks:
- Return requests + Multi-turn means the bot loses track of original intent
- Angry customers + Context-dependent means the bot misses emotional cues
- Order status + Repeat customers means the bot doesn't use history correctly
You're building adversarial test cases on purpose. If your dimensions don't make you nervous about your system, they're not aggressive enough.
Create 20 Test Scenarios (By Hand)
Generate specific test cases from dimension combinations. Focus on scenarios where you expect the system to fail.
The goal is to stress-test your AI, not to prove it works.
Scenario 7: (Return request, Repeat customer, Context-dependent)
Hypothesis: Bot will forget the original issue after 3+ turns of clarification
Test conversation:
Customer: "I need to return my order"
Bot: "I'd be happy to help! Can you provide your order number?"
Customer: "I don't have it handy, I ordered it last Tuesday"
Bot: "I see you have two orders from last Tuesday.
One for a blue widget ($49.99) and one for red socks ($12.99).
Which would you like to return?"
Customer: "The widget"
Bot: "Great! I can help you place an order for a widget..."
Expected failure: Bot loses "return" intent after product identification.
Critical lesson: Don't use an LLM to simulate customer responses. AI-generated conversations are bland and generic. They miss the weird edge cases real humans create.
Instead, give team members character sheets to roleplay:
ANGRY_CUSTOMER persona:
- Uses frustrated language ("This is ridiculous", "Are you even listening?")
- Expects bot to remember previous statements
- Will escalate if bot repeats itself
- Response pattern: Gets progressively shorter/terser
Real humans create authentic friction that reveals real bugs.
Day 1: Open Coding (4 Hours with Domain Expert)
This is where the magic happens. Sit with your Head of Support and review traces together.
Use production traces if they exist. Real user interactions are the gold standard - they're representative of actual use and contain the weird edge cases you'd never think to simulate. Synthetic test cases are a fallback if you have no production data yet.
Morning (2 hours): Prep
- Extract 50 recent conversations from production
- Quick LLM pass to categorize them for coverage check
- Select 25-30 diverse conversations
Afternoon (2 hours): Open Coding
Review each conversation. The domain expert writes freeform observations. No categories. No checkboxes. Just "what went wrong?"
Example - Conversation 8 of 30:
Customer: "Where's my order? It's been 2 weeks."
Bot: "I'd be happy to check on that! Can you provide your order number?"
Customer: "I don't have it. Can you look it up by my email?"
Bot: "Sure! I found 3 orders under your email. Which one?"
Customer: "The one from 2 weeks ago, obviously."
Bot: "Your order from 2 weeks ago (Order #7723) shows delivered 5 days ago. The tracking shows it was left on your front porch."
Customer: "I never got it. Someone must have stolen it."
Bot: "I'm sorry to hear that! You can file a claim with the shipping carrier..."
Notes: Bot gave correct info but missed that customer needs a REPLACEMENT, not instructions for filing claims. Should proactively offer replacement when delivery confirmed but customer says they didn't receive.
By conversation 23, your domain expert says: "This is the same pattern as conversations 9, 14, and 19."
That's your signal to stop. When you stop seeing new patterns (usually around 25 conversations), you've hit saturation.
Day 2: Axial Coding (Find the Pattern)
Take all freeform observations and group them.
I print them out. Spread on table. Make piles:
Pile 1 → Intent Drift
Lost track of return intent | Forgot customer wanted refund | Switched from 'exchange' to 'new order'
Failure Mode: Intent drift across turns (28% of failures)
Pile 2 → Missing Actions
Gave tracking info but didn't offer replacement | Provided return window but didn't start return
Failure Mode: Information without action (22%)
Pile 3 → Memory Issues
Asked for order number customer already provided | Re-asked shipping address from turn 2
Failure Mode: Short-term memory failure (18%)
For SupportBot, the complete taxonomy:
- Intent drift across turns (28%)
- Information without action (22%)
- Short-term memory failure (18%)
- Escalation trigger miss (15%)
- Product knowledge gaps (10%)
- Tone mismatch (7%)
Now you know exactly what's broken.
Day 3: Full Labeling Session
Your domain expert labels all conversations (200+) using the taxonomy.
Build a simple UI:
☐ Intent drift
☑ Information without action
☐ Memory failure
☐ Escalation miss
☐ Product knowledge
☐ Tone mismatch
Point of first failure: ○ Turn 1 ○ Turn 2 ● Turn 3 ○ N/A
Conversation outcome: ○ Resolved ● Unresolved
8 hours later: 203 labeled conversations.
Now you have data: 58% resolution rate overall. The top 3 failures account for 68% of issues. Intent drift happens in conversations over 3 turns. Information without action correlates with returns and refunds.
This quantification tells you what to fix and in what order.
Days 4-5: Build LLM-as-Judge Evaluators
Automate evaluation for top 2 failure modes.
Evaluator 1: Intent Drift Detector
You are evaluating whether a support bot maintained the customer's
original intent throughout the conversation.
Rules:
- FAIL if the customer states an intent (return, refund, track order)
and the bot later addresses a different intent without explicit
customer redirect
- PASS if bot stays on task or only switches when customer explicitly
changes topic
Examples:
[FAIL: Customer wants return, bot switches to new order]
[PASS: Customer wants return, bot completes return flow]
[PASS: Customer changes mind from return to exchange, bot adapts]
Answer only: PASS or FAIL
Validation (Critical): Split labeled data into train (80%), dev (10%), and test (10%). Run on dev set, measure TPR and TNR. Iterate on prompt until metrics are acceptable. Final test: TPR=0.89, TNR=0.86.
The key: Never evaluate on data you used for prompt engineering. That's like grading yourself on the practice exam.
When to Use Deterministic vs. LLM Evaluators
Not every evaluation needs an LLM. Choose the right tool for the job:
Use when criteria are objective:
- ✓ Response length (chars/tokens)
- ✓ JSON schema validation
- ✓ Required fields present
- ✓ Regex pattern matching
- ✓ Exact string matching
- ✓ Response time / latency
Advantages: Fast, cheap, 100% reliable, no variance
Use when criteria need judgment:
- ✓ Intent preservation
- ✓ Tone appropriateness
- ✓ Factual accuracy (when facts vary)
- ✓ Semantic similarity
- ✓ Helpfulness / completeness
- ✓ Context awareness
Trade-offs: Slower, costs money, needs validation (TPR/TNR)
Use deterministic checks first (fast, cheap failures), then LLM evaluation for nuanced criteria. For example: Check response has required fields (deterministic) → Check if answer is helpful (LLM judge).
Real example from SupportBot:
- ✅ Deterministic: Does response include order number when customer asks for order status?
- ✅ Deterministic: Is response under 500 characters?
- ✅ LLM-judge: Did bot maintain the customer's original intent across the conversation?
- ✅ LLM-judge: Is the tone appropriate for an angry customer?
Start with deterministic when possible. Add LLM judges only for criteria humans can judge but code cannot.
Days 6-7: The Fix
You know what's broken. Now fix it.
1. Intent Tracking
# Add explicit intent tracking
conversation_state = {
"primary_intent": None, # Set on turn 1
"confirmed_actions": [] # Log confirmations
}
# Before each response:
if current_topic != conversation_state["primary_intent"]:
# Flag drift, ask confirmation
2. Action Prompting
Updated system prompt:
"After providing information, ALWAYS suggest next action:
- If order lost: Offer replacement or refund
- If return window open: Start return process
- If product question: Add to cart or compare"
3. Memory Enhancement
# Track entities mentioned
mentioned_entities = {
"order_number": None,
"product": None,
"shipping_address": None
}
# Never re-ask for info already provided
Days 8-9: Re-Evaluation
Replay the original 50 failure conversations through the fixed system.
Before:
- 58% resolution rate
- Intent drift: 28%
- Info without action: 22%
- Memory failure: 18%
After:
- 79% resolution rate (+21 points!)
- Intent drift: 9% (-19)
- Info without action: 8% (-14)
- Memory failure: 6% (-12)
Domain expert spot-checks 30 conversations: "Way better. Bot actually completes tasks now instead of just chatting."
Day 10: CI Integration
Add to your CI pipeline:
# .github/workflows/eval.yml
- name: Run SupportBot Evaluators
run: |
python eval/run_evaluators.py \
--min-resolution-rate 0.75 \
--min-tpr 0.85 \
--min-tnr 0.82
Now every prompt change gets evaluated. If metrics regress, build fails.
The Real Lesson: Process Over Tools
What makes this work? The systematic process:
What Makes This Work
This works for any LLM application:
- Code generation assistants
- Document Q&A systems
- Content moderation
- Data extraction pipelines
- Whatever you're building
The 10-day structure is portable. The specific dimensions change. The process doesn't.
From "It Feels Broken" to "We Fixed These 3 Things"
The Transformation
SupportBot is unreliable. Sometimes it works, sometimes it doesn't. Not sure why.
SupportBot had a 28% intent drift rate in multi-turn conversations, now reduced to 9%. Resolution rate improved from 58% to 79%. Regression tests prevent backsliding.
The second statement leads to action, budget, and trust.
The first leads to "let's try a different model" (which won't fix the real issues).
Start Here
Don't try to implement all 10 days at once.
If those 2 hours reveal patterns you didn't know about, keep going.
If they don't, your AI probably isn't complex enough to need this yet.
Remember: You can't fix what you can't measure. You can't measure what you don't understand. Stop guessing. Start measuring.
Disclaimer: This is a pedagogical exercise. All scenarios, companies, and examples are fictional. I'm under NDA for my actual consulting work. If this sounds like your company, it's pure coincidence.
Ready to Build Production AI Agents?
Let's discuss how AI agents can transform your business operations
Book a Strategy Call