How to Actually Evaluate Your LLM (And Stop Guessing)

Your AI is broken. You just don't know how broken yet.

You ship a customer service bot. Metrics look good. Users complain anyway.

Customer: "I ordered a blue widget but got red. Can I return it?"

Bot: "Absolutely! Return started. Label coming soon."

Customer: "Great! When will my blue widget ship?"

Bot: "I don't see a blue widget order. Want to place one?"

Customer: "WHAT? I just said I ordered it last week!"

Bot: "I understand you'd like to purchase a blue widget..."

[Customer rage-quits]

Where did it break? Turn 2? Turn 3? All of them?

Each response seems... reasonable? But the conversation is a trainwreck.

You have 200 similar threads. Reading them all will take three days, and you'll miss half the issues anyway.

There's a better way.

The Framework That Actually Works

The process has four steps:

(1) Analyze - understand what's failing by working with domain experts to identify patterns.

(2) Measure - build automated evaluators that catch specific failure modes.

(3) Improve - make targeted fixes and re-run evals to prove they worked.

(4) Repeat - integrate into CI/CD so quality never regresses.

The key insight: You can't measure what you don't understand.

Most teams jump straight to building automated evaluators and miss application-specific failures. Metrics look good. Users stay frustrated.

Let me show you the process that actually works.

Day 0: Define Your Dimensions

Before touching any traces, organize the problem space with dimensions - axes of variation where your AI is most likely to break.

The goal: actively try to break your system by targeting its weakest points.

Don't know where it will fail? Use your product yourself. Spend 2-3 hours actually using the AI like a real user would. The pain points will become obvious. If you don't have intuition about failure modes, you're not ready to evaluate yet.

For our hypothetical SupportBot (a customer service AI), I'd use these three dimensions:

1. Issue Type: Order status inquiry, return/refund request, product question, account issue, complaint/escalation.

2. Customer Context: New customer (first purchase), repeat customer, high-value customer (over $1000 lifetime), angry customer (complaint language).

3. Conversation Complexity: Single-turn (one question, one answer), multi-turn clarification (2-3 exchanges), complex negotiation (4+ turns), context-dependent (references earlier turns).

These aren't arbitrary. Each dimension targets a specific way the system breaks:

Return requests + Multi-turn means the bot loses track of original intent
Angry customers + Context-dependent means the bot misses emotional cues
Order status + Repeat customers means the bot doesn't use history correctly

You're building adversarial test cases on purpose. If your dimensions don't make you nervous about your system, they're not aggressive enough.

Create 20 Test Scenarios (By Hand)

Generate specific test cases from dimension combinations. Focus on scenarios where you expect the system to fail.

The goal is to stress-test your AI, not to prove it works.

Scenario 7: (Return request, Repeat customer, Context-dependent)

Hypothesis: Bot will forget the original issue after 3+ turns of clarification

Test conversation:

Customer: "I need to return my order"
Bot: "I'd be happy to help! Can you provide your order number?"
Customer: "I don't have it handy, I ordered it last Tuesday"
Bot: "I see you have two orders from last Tuesday.
      One for a blue widget ($49.99) and one for red socks ($12.99).
      Which would you like to return?"
Customer: "The widget"
Bot: "Great! I can help you place an order for a widget..."

Expected failure: Bot loses "return" intent after product identification.

Critical lesson: Don't use an LLM to simulate customer responses. AI-generated conversations are bland and generic. They miss the weird edge cases real humans create.

Instead, give team members character sheets to roleplay:

ANGRY_CUSTOMER persona:
- Uses frustrated language ("This is ridiculous", "Are you even listening?")
- Expects bot to remember previous statements
- Will escalate if bot repeats itself
- Response pattern: Gets progressively shorter/terser

Real humans create authentic friction that reveals real bugs.

Day 1: Open Coding (4 Hours with Domain Expert)

This is where the magic happens. Sit with your Head of Support and review traces together.

Use production traces if they exist. Real user interactions are the gold standard - they're representative of actual use and contain the weird edge cases you'd never think to simulate. Synthetic test cases are a fallback if you have no production data yet.

Morning (2 hours): Prep

Extract 50 recent conversations from production
Quick LLM pass to categorize them for coverage check
Select 25-30 diverse conversations

Afternoon (2 hours): Open Coding

Review each conversation. The domain expert writes freeform observations. No categories. No checkboxes. Just "what went wrong?"

Example - Conversation 8 of 30:

Customer: "Where's my order? It's been 2 weeks."

Bot: "I'd be happy to check on that! Can you provide your order number?"

Customer: "I don't have it. Can you look it up by my email?"

Bot: "Sure! I found 3 orders under your email. Which one?"

Customer: "The one from 2 weeks ago, obviously."

Bot: "Your order from 2 weeks ago (Order #7723) shows delivered 5 days ago. The tracking shows it was left on your front porch."

Customer: "I never got it. Someone must have stolen it."

Bot: "I'm sorry to hear that! You can file a claim with the shipping carrier..."

Notes: Bot gave correct info but missed that customer needs a REPLACEMENT, not instructions for filing claims. Should proactively offer replacement when delivery confirmed but customer says they didn't receive.

By conversation 23, your domain expert says: "This is the same pattern as conversations 9, 14, and 19."

That's your signal to stop. When you stop seeing new patterns (usually around 25 conversations), you've hit saturation.

Day 2: Axial Coding (Find the Pattern)

Take all freeform observations and group them.

I print them out. Spread on table. Make piles:

Pile 1 → Intent Drift

❌Bad

Lost track of return intent | Forgot customer wanted refund | Switched from 'exchange' to 'new order'

✅Good

Failure Mode: Intent drift across turns (28% of failures)

Pile 2 → Missing Actions

❌Bad

Gave tracking info but didn't offer replacement | Provided return window but didn't start return

✅Good

Failure Mode: Information without action (22%)

Pile 3 → Memory Issues

❌Bad

Asked for order number customer already provided | Re-asked shipping address from turn 2

✅Good

Failure Mode: Short-term memory failure (18%)

For SupportBot, the complete taxonomy:

Intent drift across turns (28%)
Information without action (22%)
Short-term memory failure (18%)
Escalation trigger miss (15%)
Product knowledge gaps (10%)
Tone mismatch (7%)

Now you know exactly what's broken.

Day 3: Full Labeling Session

Your domain expert labels all conversations (200+) using the taxonomy.

Build a simple UI:

☐ Intent drift
☑ Information without action
☐ Memory failure
☐ Escalation miss
☐ Product knowledge
☐ Tone mismatch

Point of first failure: ○ Turn 1  ○ Turn 2  ● Turn 3  ○ N/A
Conversation outcome: ○ Resolved  ● Unresolved

8 hours later: 203 labeled conversations.

Now you have data: 58% resolution rate overall. The top 3 failures account for 68% of issues. Intent drift happens in conversations over 3 turns. Information without action correlates with returns and refunds.

This quantification tells you what to fix and in what order.

Days 4-5: Build LLM-as-Judge Evaluators

Automate evaluation for top 2 failure modes.

Evaluator 1: Intent Drift Detector

You are evaluating whether a support bot maintained the customer's
original intent throughout the conversation.

Rules:
- FAIL if the customer states an intent (return, refund, track order)
  and the bot later addresses a different intent without explicit
  customer redirect
- PASS if bot stays on task or only switches when customer explicitly
  changes topic

Examples:
[FAIL: Customer wants return, bot switches to new order]
[PASS: Customer wants return, bot completes return flow]
[PASS: Customer changes mind from return to exchange, bot adapts]

Answer only: PASS or FAIL

Validation (Critical): Split labeled data into train (80%), dev (10%), and test (10%). Run on dev set, measure TPR and TNR. Iterate on prompt until metrics are acceptable. Final test: TPR=0.89, TNR=0.86.

The key: Never evaluate on data you used for prompt engineering. That's like grading yourself on the practice exam.

When to Use Deterministic vs. LLM Evaluators

Not every evaluation needs an LLM. Choose the right tool for the job:

Deterministic Evaluators

Use when criteria are objective:

✓ Response length (chars/tokens)
✓ JSON schema validation
✓ Required fields present
✓ Regex pattern matching
✓ Exact string matching
✓ Response time / latency

Advantages: Fast, cheap, 100% reliable, no variance

LLM-as-Judge Evaluators

Use when criteria need judgment:

✓ Intent preservation
✓ Tone appropriateness
✓ Factual accuracy (when facts vary)
✓ Semantic similarity
✓ Helpfulness / completeness
✓ Context awareness

Trade-offs: Slower, costs money, needs validation (TPR/TNR)

✨

The hybrid approach wins

Use deterministic checks first (fast, cheap failures), then LLM evaluation for nuanced criteria. For example: Check response has required fields (deterministic) → Check if answer is helpful (LLM judge).

Real example from SupportBot:

✅ Deterministic: Does response include order number when customer asks for order status?
✅ Deterministic: Is response under 500 characters?
✅ LLM-judge: Did bot maintain the customer's original intent across the conversation?
✅ LLM-judge: Is the tone appropriate for an angry customer?

Start with deterministic when possible. Add LLM judges only for criteria humans can judge but code cannot.

Days 6-7: The Fix

You know what's broken. Now fix it.

1. Intent Tracking

hljs python

# Add explicit intent tracking
conversation_state = {
    "primary_intent": None,  # Set on turn 1
    "confirmed_actions": []   # Log confirmations
}

# Before each response:
if current_topic != conversation_state["primary_intent"]:
    # Flag drift, ask confirmation

2. Action Prompting

Updated system prompt:
"After providing information, ALWAYS suggest next action:
- If order lost: Offer replacement or refund
- If return window open: Start return process
- If product question: Add to cart or compare"

3. Memory Enhancement

hljs python

# Track entities mentioned
mentioned_entities = {
    "order_number": None,
    "product": None,
    "shipping_address": None
}
# Never re-ask for info already provided

Days 8-9: Re-Evaluation

Replay the original 50 failure conversations through the fixed system.

Before:

58% resolution rate
Intent drift: 28%
Info without action: 22%
Memory failure: 18%

After:

79% resolution rate (+21 points!)
Intent drift: 9% (-19)
Info without action: 8% (-14)
Memory failure: 6% (-12)

Domain expert spot-checks 30 conversations: "Way better. Bot actually completes tasks now instead of just chatting."

Day 10: CI Integration

Add to your CI pipeline:

hljs yaml

# .github/workflows/eval.yml
- name: Run SupportBot Evaluators
  run: |
    python eval/run_evaluators.py \
      --min-resolution-rate 0.75 \
      --min-tpr 0.85 \
      --min-tnr 0.82

Now every prompt change gets evaluated. If metrics regress, build fails.

The Real Lesson: Process Over Tools

What makes this work? The systematic process:

What Makes This Work

Dimensions organize the problem space (not random tests)

Open coding reveals what you don't know (not assumptions)

Domain experts catch what engineers miss

Saturation prevents over-analysis (not reviewing 1000 traces)

Binary judgments force clarity (not vague scales)

TPR/TNR validate evaluators (not "trust the LLM")

Session-level eval catches cascading failures

This works for any LLM application:

Code generation assistants
Document Q&A systems
Content moderation
Data extraction pipelines
Whatever you're building

The 10-day structure is portable. The specific dimensions change. The process doesn't.

From "It Feels Broken" to "We Fixed These 3 Things"

The Transformation

❌Bad

SupportBot is unreliable. Sometimes it works, sometimes it doesn't. Not sure why.

✅Good

SupportBot had a 28% intent drift rate in multi-turn conversations, now reduced to 9%. Resolution rate improved from 58% to 79%. Regression tests prevent backsliding.

The second statement leads to action, budget, and trust.

The first leads to "let's try a different model" (which won't fix the real issues).

Start Here

Don't try to implement all 10 days at once.

Step 1:

Define 3 dimensions for your LLM application

→ 2 hours

Step 2:

Create 5 test scenarios targeting known pain points

→ 1 hour

✓

Step 3:

Manually review them with your domain expert

✅ 1 hour

If those 2 hours reveal patterns you didn't know about, keep going.

If they don't, your AI probably isn't complex enough to need this yet.

Remember: You can't fix what you can't measure. You can't measure what you don't understand. Stop guessing. Start measuring.

Disclaimer: This is a pedagogical exercise. All scenarios, companies, and examples are fictional. I'm under NDA for my actual consulting work. If this sounds like your company, it's pure coincidence.

How to Actually Evaluate Your LLM (And Stop Guessing)

The Framework That Actually Works

Day 0: Define Your Dimensions

Create 20 Test Scenarios (By Hand)

Day 1: Open Coding (4 Hours with Domain Expert)

Day 2: Axial Coding (Find the Pattern)

Pile 1 → Intent Drift

Pile 2 → Missing Actions

Pile 3 → Memory Issues

Day 3: Full Labeling Session

Days 4-5: Build LLM-as-Judge Evaluators

Evaluator 1: Intent Drift Detector

When to Use Deterministic vs. LLM Evaluators

Days 6-7: The Fix

Days 8-9: Re-Evaluation

Day 10: CI Integration

The Real Lesson: Process Over Tools

What Makes This Work

From "It Feels Broken" to "We Fixed These 3 Things"

The Transformation

Start Here

Ready to Build Production AI Agents?