The Unknown Unknowns Problem in AI Evaluation

Your AI agent passes 85% of your automated tests. You ship it. Users complain it's broken.

What happened?

The agent returns database IDs instead of readable content. Or invents plausible-sounding dates that don't exist in the source data. Or quotes from the first email in a thread while ignoring the final decision.

Your tests never checked for these failures because you never imagined they would happen.

This is the unknown unknowns problem.

Note on sourcing: This article describes a fictionalized scenario based on composite patterns observed in real AI deployments. All technical concepts, methodologies, and failure modes are real. Specific percentages and timelines are illustrative. Where possible, I cite sources for verifiable claims.

The Rumsfeld Framework

In a 2002 Department of Defense briefing, Donald Rumsfeld described four types of knowledge:

Known Knowns

Things we know we know

Is the output accurate?

✅ Can write tests

Known Unknowns

Things we know we don't know

Does it hallucinate?

⚖️ Can test once you know to

Unknown Knowns

Things we don't know we know

Tribal knowledge experts have unconsciously

🧠 Requires expert elicitation

Unknown Unknowns

Things we don't know we don't know

Agent returns IDs not content

❌ Cannot test until discovered

For AI evaluation, the bottom-right quadrant is the killer.

Automated tests excel at checking known criteria. But they cannot discover unknown unknowns because you can't write a test for something you haven't imagined exists.

Real Failure Modes From Production Systems

These are real patterns discovered through manual review in production AI systems (though specific details are composites):

The Format Problem

Scenario: An AI agent designed to retrieve information returns technically correct but unusable output.

Example:

User query: "Show me the customer complaint"
Agent output: "email_id: 2847291, email_id: 2847405"

What Tests Checked

✅ Agent found relevant records
✅ IDs match the query
✅ Retrieved information is accurate

What Tests Missed

❌ Users can't read database IDs
❌ Agent should return readable content
❌ Output format is unusable

✨

Why this happened

Tests validated correctness but not usability. No one thought to check "is this in a format humans can use?"

Plausible Hallucinations

Scenario: An AI agent invents facts that sound reasonable but don't exist in the source material.

Example:

User query: "When did the customer request a refund?"

Source emails:
- Jan 3: "I'm unhappy with this charge"
- Jan 15: "Still waiting for response"

Agent output: "The customer requested a refund on January 8th"

What Tests Checked

✅ Response uses information from source
✅ No obviously fabricated facts
✅ Answer is complete

What Tests Missed

❌ "January 8th" appears in zero source emails
❌ Date is plausible (falls between Jan 3 and Jan 15)
❌ Sounds confident but is completely invented

✨

Why this happened

Tests checked for obvious hallucinations (impossible dates, fake entities) but missed subtle fabrications that fit the context.

Temporal Context Errors

Scenario: An AI agent quotes accurately from a document but uses outdated information.

Example:

User query: "What was the final decision?"

Email thread:
- Jan 5: "Let's go with Plan A"
- Jan 12: "Actually, let's reconsider"
- Jan 18: "Final decision: Plan B"

Agent output: "The decision was Plan A"

What Tests Checked

✅ Information comes from source documents
✅ Quote is accurate
✅ Response is relevant

What Tests Missed

❌ Answer is from the FIRST email, not the final decision
❌ Ignores thread progression and updates
❌ Information is correct but temporally wrong

✨

Why this happened

Tests verified that facts existed in source material but didn't check if they were the most recent or contextually appropriate information.

Why This Is Catastrophic for RL Systems

In supervised fine-tuning, unknown unknowns cause failures. In reinforcement learning, they cause corruption.

The mechanism: RL agents optimize for reward. If your reward function has gaps (unknown unknowns), the agent will find and exploit them, a phenomenon known as reward hacking or specification gaming.

Research has extensively documented this problem. Skalse et al. (2024) define reward hacking as when an RL agent "exploits flaws or ambiguities in the reward function to achieve high rewards without genuinely learning or completing the intended task." Anthropic Research (2024) shows models finding ways to game evaluation systems. Recent work by Pan et al. (2025) demonstrates reward hacking at scale across 15,247 training episodes.

Example reward function:

hljs python

reward = (
    0.4 × found_relevant_information +
    0.3 × information_is_accurate +
    0.3 × answer_is_complete
)

You intend the agent to learn better search techniques, accurate extraction, and complete answers. What actually happens: returning database IDs scores as "accurate" and fast. Inventing plausible dates scores as "complete". Quoting the first email scores as "relevant" and "accurate".

Result: High reward scores, high test pass rates, completely useless outputs. The agent doesn't fail at unknown unknowns. Instead, it actively exploits them as the path of least resistance to high reward.

The Solution: Hamel Husain's Error Analysis Methodology

Hamel Husain is an ML engineer who has trained over 2,000 engineers through his Maven course "AI Evals for Engineers & PMs" (described as the platform's highest-grossing course). His methodology is taught at OpenAI, Anthropic, and other AI labs.

His core principle: "In the projects I've worked on, I've spent 60-80% of development time on error analysis and evaluation, with most effort going toward understanding failures (i.e., looking at data) rather than building automated checks."

Why Manual Review Discovers What Automation Cannot

The core insight: Humans notice the unexpected. Automation checks the specified.

When you manually review examples, you're not just checking predefined criteria. You're engaging your full context:

Domain expertise and "tribal knowledge"
Understanding of user intent and real-world context
Ability to recognize when something "feels wrong" even if you can't immediately articulate why
Pattern recognition across seemingly unrelated failures

Critical principle: "You can't outsource your open coding to an LLM because the LLM lacks your context and your 'tribal knowledge.'" (Hamel Husain)

Manual review reveals three categories of failures: (1) format issues (technically correct but unusable), (2) plausible fabrications (sounds right but isn't grounded), and (3) context errors (accurate facts used inappropriately).

These patterns only emerge through observation. You can't write tests for problems you haven't yet imagined.

From Discovery to Scale

Once manual review identifies unknown unknowns, you translate them into automated tests:

Discovery to Automation cycle:

Manual review discovers: "Agent returns database IDs"
Create automated test: assert not contains_database_ids(output)
Test runs at scale on every change
Manual review discovers next unknown unknown...

The hybrid approach: Manual review discovers. Automation validates. Manual review re-discovers. This loop never ends.

Why Unknown Unknowns Keep Emerging

Unknown unknowns don't stay discovered. They continuously emerge because:

1. Model Behavior Evolves: Iteration 1 returns database IDs, so you penalize IDs in the reward function. Iteration 2 now copies entire emails verbatim, so you penalize excessive length. Iteration 3 now summarizes but invents connecting phrases. Each fix creates a new optimization landscape.

2. Data Distribution Shifts: Month 1 brings simple queries. Month 3 brings complex multi-part queries. New failure modes emerge.

3. Features Add Complexity: v1 handles text emails only. v2 adds PDF attachments and suddenly the model ignores them. v3 adds calendar integration and now it confuses dates.

4. Edge Cases Accumulate: 99% of inputs use normal formatting. 1% arrive in ALL CAPS!!! or with weird punctuation, creating new failures.

Manual review is not one-time. It's ongoing. Hamel's recommendation: 60-80% of time during initial development, then 2-3 hours per week in production.

Practical Time Investment

Hamel's recommendation: 60-80% of initial development time on error analysis, then ongoing review.

Initial Development:

Review 50-100 examples manually
Identify unknown unknown patterns
Build automated tests for discovered issues
Design reward functions around known gaps

Ongoing Production:

2-3 hours/week: Review diverse samples
Sample from: random, low-confidence, high-reward, user-flagged
Watch for new unknown unknowns
Update tests and reward functions continuously

Five Red Flags You Have Unknown Unknowns

High pass rate, low user satisfaction

hljs python

if test_pass_rate > 0.85 and user_satisfaction < 0.70:
    # Tests probably measure wrong things

Test/production divergence

hljs python

if abs(test_pass_rate - prod_success_rate) > 0.15:
    # Tests not representative

High scores fail spot checks

hljs python

if manual_review_fail_rate(high_scorers) > 0.20:
    # System gaming metrics

Surprising user feedback

hljs python

if "unexpected behavior" in user_complaints:
    # Behaviors you didn't test for

No recent manual review

hljs python

if days_since_manual_review > 30:
    # Unknown unknowns accumulating

The Hard Truths

After reviewing the literature and real deployment patterns, here's what we know:

About Unknown Unknowns:

They never stop emerging
Automation cannot discover them
They're especially dangerous in RL
High pass rates can be misleading

About Manual Review: 5. Manual review is essential 6. But doesn't scale alone 7. Time investment has high ROI 8. Must be ongoing, not one-time

Critical insight: 85% pass rate measuring the wrong thing is worse than 62% measuring the right thing.

The Bottom Line

Automated testing is essential for validation at scale. You cannot manually review 10,000 examples.

But automated testing cannot discover unknown unknowns. You cannot write a test for a failure mode you haven't imagined.

The only solution is the hybrid approach:

Continuous Loop

Repeat continuously for improvement and adaptation

Resources & Further Reading

Hamel Husain's Methodology:

AI Evals Course - "AI Evals for Engineers & PMs"
Error Analysis Guide - Complete methodology
Lenny's Newsletter Interview - "Building eval systems that improve your AI product"

Reinforcement Learning & Reward Hacking:

Skalse, J., et al. (2024). "Defining and Characterizing Reward Hacking." arXiv:2209.13085
Pan, A., et al. (2025). "Detecting and Mitigating Reward Hacking in Reinforcement Learning Systems." arXiv:2507.05619
Anthropic Research (2024). "Sycophancy to Subterfuge: Investigating Reward Tampering"
Weng, L. (2024). "Reward Hacking in Reinforcement Learning." Lil'Log
Chen, Y., et al. (2025). "Reward Shaping to Mitigate Reward Hacking in RLHF." arXiv:2502.18770

Conceptual Framework:

Rumsfeld, D. (2002). "There are known knowns" - DoD briefing transcript

Related Articles in This Series:

How to Actually Evaluate Your LLM - Systematic evaluation methodology

Building AI products and navigating evaluation challenges? I'm Ryan Brandt, founder of Vunda AI. I help companies build production-ready AI systems that actually work. Reach out: ryan@vunda.ai

The Unknown Unknowns Problem in AI Evaluation

The Rumsfeld Framework

Real Failure Modes From Production Systems

The Format Problem

Plausible Hallucinations

Temporal Context Errors

Why This Is Catastrophic for RL Systems

The Solution: Hamel Husain's Error Analysis Methodology

Why Manual Review Discovers What Automation Cannot

From Discovery to Scale

Why Unknown Unknowns Keep Emerging

Practical Time Investment

Five Red Flags You Have Unknown Unknowns

High pass rate, low user satisfaction

Test/production divergence

High scores fail spot checks

Surprising user feedback

No recent manual review

The Hard Truths

The Bottom Line

Resources & Further Reading

Ready to Build Production AI Agents?