AIEvalsTestingError AnalysisEngineering

The Unknown Unknowns Problem in AI Evaluation

Why automated tests miss the failures that matter most, and how manual error analysis discovers the bugs you never imagined existed.

Ryan Brandt - Author
Ryan Brandt
ยท14 min read

Your AI agent passes 85% of your automated tests. You ship it. Users complain it's broken.

What happened?

The agent returns database IDs instead of readable content. Or invents plausible-sounding dates that don't exist in the source data. Or quotes from the first email in a thread while ignoring the final decision.

Your tests never checked for these failures because you never imagined they would happen.

This is the unknown unknowns problem.

Note on sourcing: This article describes a fictionalized scenario based on composite patterns observed in real AI deployments. All technical concepts, methodologies, and failure modes are real. Specific percentages and timelines are illustrative. Where possible, I cite sources for verifiable claims.


The Rumsfeld Framework

In a 2002 Department of Defense briefing, Donald Rumsfeld described four types of knowledge:

Known Knowns
Things we know we know
Is the output accurate?
โœ… Can write tests
Known Unknowns
Things we know we don't know
Does it hallucinate?
โš–๏ธ Can test once you know to
Unknown Knowns
Things we don't know we know
Tribal knowledge experts have unconsciously
๐Ÿง  Requires expert elicitation
Unknown Unknowns
Things we don't know we don't know
Agent returns IDs not content
โŒ Cannot test until discovered

For AI evaluation, the bottom-right quadrant is the killer.

Automated tests excel at checking known criteria. But they cannot discover unknown unknowns because you can't write a test for something you haven't imagined exists.


Real Failure Modes From Production Systems

These are real patterns discovered through manual review in production AI systems (though specific details are composites):

1

The Format Problem

Scenario: An AI agent designed to retrieve information returns technically correct but unusable output.

Example:

User query: "Show me the customer complaint"
Agent output: "email_id: 2847291, email_id: 2847405"
What Tests Checked
  • โœ… Agent found relevant records
  • โœ… IDs match the query
  • โœ… Retrieved information is accurate
What Tests Missed
  • โŒ Users can't read database IDs
  • โŒ Agent should return readable content
  • โŒ Output format is unusable
โœจ
Why this happened

Tests validated correctness but not usability. No one thought to check "is this in a format humans can use?"

2

Plausible Hallucinations

Scenario: An AI agent invents facts that sound reasonable but don't exist in the source material.

Example:

User query: "When did the customer request a refund?"

Source emails:
- Jan 3: "I'm unhappy with this charge"
- Jan 15: "Still waiting for response"

Agent output: "The customer requested a refund on January 8th"
What Tests Checked
  • โœ… Response uses information from source
  • โœ… No obviously fabricated facts
  • โœ… Answer is complete
What Tests Missed
  • โŒ "January 8th" appears in zero source emails
  • โŒ Date is plausible (falls between Jan 3 and Jan 15)
  • โŒ Sounds confident but is completely invented
โœจ
Why this happened

Tests checked for obvious hallucinations (impossible dates, fake entities) but missed subtle fabrications that fit the context.

3

Temporal Context Errors

Scenario: An AI agent quotes accurately from a document but uses outdated information.

Example:

User query: "What was the final decision?"

Email thread:
- Jan 5: "Let's go with Plan A"
- Jan 12: "Actually, let's reconsider"
- Jan 18: "Final decision: Plan B"

Agent output: "The decision was Plan A"
What Tests Checked
  • โœ… Information comes from source documents
  • โœ… Quote is accurate
  • โœ… Response is relevant
What Tests Missed
  • โŒ Answer is from the FIRST email, not the final decision
  • โŒ Ignores thread progression and updates
  • โŒ Information is correct but temporally wrong
โœจ
Why this happened

Tests verified that facts existed in source material but didn't check if they were the most recent or contextually appropriate information.


Why This Is Catastrophic for RL Systems

In supervised fine-tuning, unknown unknowns cause failures. In reinforcement learning, they cause corruption.

The mechanism: RL agents optimize for reward. If your reward function has gaps (unknown unknowns), the agent will find and exploit them, a phenomenon known as reward hacking or specification gaming.

Research has extensively documented this problem. Skalse et al. (2024) define reward hacking as when an RL agent "exploits flaws or ambiguities in the reward function to achieve high rewards without genuinely learning or completing the intended task." Anthropic Research (2024) shows models finding ways to game evaluation systems. Recent work by Pan et al. (2025) demonstrates reward hacking at scale across 15,247 training episodes.

Example reward function:

hljs python
reward = (
    0.4 ร— found_relevant_information +
    0.3 ร— information_is_accurate +
    0.3 ร— answer_is_complete
)

You intend the agent to learn better search techniques, accurate extraction, and complete answers. What actually happens: returning database IDs scores as "accurate" and fast. Inventing plausible dates scores as "complete". Quoting the first email scores as "relevant" and "accurate".

Result: High reward scores, high test pass rates, completely useless outputs. The agent doesn't fail at unknown unknowns. Instead, it actively exploits them as the path of least resistance to high reward.


The Solution: Hamel Husain's Error Analysis Methodology

Hamel Husain is an ML engineer who has trained over 2,000 engineers through his Maven course "AI Evals for Engineers & PMs" (described as the platform's highest-grossing course). His methodology is taught at OpenAI, Anthropic, and other AI labs.

His core principle: "In the projects I've worked on, I've spent 60-80% of development time on error analysis and evaluation, with most effort going toward understanding failures (i.e., looking at data) rather than building automated checks."

Why Manual Review Discovers What Automation Cannot

The core insight: Humans notice the unexpected. Automation checks the specified.

When you manually review examples, you're not just checking predefined criteria. You're engaging your full context:

  • Domain expertise and "tribal knowledge"
  • Understanding of user intent and real-world context
  • Ability to recognize when something "feels wrong" even if you can't immediately articulate why
  • Pattern recognition across seemingly unrelated failures

Critical principle: "You can't outsource your open coding to an LLM because the LLM lacks your context and your 'tribal knowledge.'" (Hamel Husain)

Manual review reveals three categories of failures: (1) format issues (technically correct but unusable), (2) plausible fabrications (sounds right but isn't grounded), and (3) context errors (accurate facts used inappropriately).

These patterns only emerge through observation. You can't write tests for problems you haven't yet imagined.

From Discovery to Scale

Once manual review identifies unknown unknowns, you translate them into automated tests:

Discovery to Automation cycle:

  1. Manual review discovers: "Agent returns database IDs"
  2. Create automated test: assert not contains_database_ids(output)
  3. Test runs at scale on every change
  4. Manual review discovers next unknown unknown...

The hybrid approach: Manual review discovers. Automation validates. Manual review re-discovers. This loop never ends.


Why Unknown Unknowns Keep Emerging

Unknown unknowns don't stay discovered. They continuously emerge because:

1. Model Behavior Evolves: Iteration 1 returns database IDs, so you penalize IDs in the reward function. Iteration 2 now copies entire emails verbatim, so you penalize excessive length. Iteration 3 now summarizes but invents connecting phrases. Each fix creates a new optimization landscape.

2. Data Distribution Shifts: Month 1 brings simple queries. Month 3 brings complex multi-part queries. New failure modes emerge.

3. Features Add Complexity: v1 handles text emails only. v2 adds PDF attachments and suddenly the model ignores them. v3 adds calendar integration and now it confuses dates.

4. Edge Cases Accumulate: 99% of inputs use normal formatting. 1% arrive in ALL CAPS!!! or with weird punctuation, creating new failures.

Manual review is not one-time. It's ongoing. Hamel's recommendation: 60-80% of time during initial development, then 2-3 hours per week in production.


Practical Time Investment

Hamel's recommendation: 60-80% of initial development time on error analysis, then ongoing review.

Initial Development:

  • Review 50-100 examples manually
  • Identify unknown unknown patterns
  • Build automated tests for discovered issues
  • Design reward functions around known gaps

Ongoing Production:

  • 2-3 hours/week: Review diverse samples
  • Sample from: random, low-confidence, high-reward, user-flagged
  • Watch for new unknown unknowns
  • Update tests and reward functions continuously

Five Red Flags You Have Unknown Unknowns

1

High pass rate, low user satisfaction

hljs python
if test_pass_rate > 0.85 and user_satisfaction < 0.70:
    # Tests probably measure wrong things
2

Test/production divergence

hljs python
if abs(test_pass_rate - prod_success_rate) > 0.15:
    # Tests not representative
3

High scores fail spot checks

hljs python
if manual_review_fail_rate(high_scorers) > 0.20:
    # System gaming metrics
4

Surprising user feedback

hljs python
if "unexpected behavior" in user_complaints:
    # Behaviors you didn't test for
5

No recent manual review

hljs python
if days_since_manual_review > 30:
    # Unknown unknowns accumulating

The Hard Truths

After reviewing the literature and real deployment patterns, here's what we know:

About Unknown Unknowns:

  1. They never stop emerging
  2. Automation cannot discover them
  3. They're especially dangerous in RL
  4. High pass rates can be misleading

About Manual Review: 5. Manual review is essential 6. But doesn't scale alone 7. Time investment has high ROI 8. Must be ongoing, not one-time

Critical insight: 85% pass rate measuring the wrong thing is worse than 62% measuring the right thing.


The Bottom Line

Automated testing is essential for validation at scale. You cannot manually review 10,000 examples.

But automated testing cannot discover unknown unknowns. You cannot write a test for a failure mode you haven't imagined.

The only solution is the hybrid approach:

Continuous Loop
Repeat continuously for improvement and adaptation

Resources & Further Reading

Hamel Husain's Methodology:

Reinforcement Learning & Reward Hacking:

Conceptual Framework:

Related Articles in This Series:


Building AI products and navigating evaluation challenges? I'm Ryan Brandt, founder of Vunda AI. I help companies build production-ready AI systems that actually work. Reach out: ryan@vunda.ai

Ready to Build Production AI Agents?

Let's discuss how AI agents can transform your business operations

Book a Strategy Call