The $500 AI That Just Beat Gemini at Abstract Reasoning

On October 6, 2025, a Samsung researcher named Alexia Jolicoeur-Martineau quietly dropped a paper on arXiv that made me do a double-take.

The paper introduces TRM (Tiny Recursive Model), a 7-million parameter model that scores 7.8% on ARC-AGI-2. For context, Gemini 2.5 Pro scores 4.9%.

Full disclosure: I was the lead contributor to the ARC-AGI 2 testing repository, the benchmark TRM is crushing. When I saw these results, I immediately pulled up the paper. What Samsung accomplished here is genuinely remarkable.

Here's the wild part: TRM cost under $500 to train (estimated based on reported compute requirements) and uses 95,000x fewer parameters than models like DeepSeek-R1 (7M vs 671B parameters). It trained in two days on four GPUs.

Different approaches serve different purposes. Scaling has worked incredibly well. GPT-4, Claude, and Gemini are extraordinary achievements. But TRM shows there's another path worth exploring for certain problem classes, and the implications are fascinating.

The Setup: What TRM Actually Does

Let me start with what TRM is not: a general-purpose language model. It won't write you an essay or have a conversation. It doesn't replace GPT-4 or Claude. It doesn't have world knowledge.

What TRM is: A specialized reasoning system designed for abstract problem-solving tasks like puzzles, mazes, and logical reasoning challenges.

The approach is elegantly simple. Instead of scaling up parameters, TRM uses iterative refinement: (1) make an initial guess, (2) critique yourself and ask "how can I improve this?", (3) refine based on reflection, and (4) repeat up to 16 times, getting better each round.

Think about how you solve a Sudoku puzzle. You don't immediately know every answer. You place a few digits, notice patterns, catch mistakes, and gradually converge on the solution.

That's TRM. A tiny 2-layer neural network that recursively refines its reasoning.

Why ARC-AGI 2 Is The Perfect Test

When we built ARC-AGI 2, we had one goal: create a benchmark that tests reasoning, not memorization.

Most benchmarks eventually get "solved" by bigger models that learn patterns in the training data. ARC-AGI is different. It tests fluid intelligence: the ability to reason about completely novel problems. Pattern recognition: identify abstract visual patterns. Rule application: apply discovered rules to new situations. Generalization: transfer reasoning without memorization. Novel problems: handle completely unfamiliar challenges.

These are tasks where humans excel but most AI struggles. I've watched billion-dollar models get single-digit scores. State-of-the-art systems miss patterns a five-year-old would spot.

So when I saw a 7-million-parameter model outperforming Gemini 2.5 Pro and o3-mini-high on the benchmark I helped create? Yeah. That got my attention.

The Results (And Why They Matter)

TRM crushed the benchmarks: 87.4% on Sudoku-Extreme (previous best was 55%), 85.3% on Maze-Hard, 44.6% on ARC-AGI-1 (competitive with large models), and 7.8% on ARC-AGI-2 versus Gemini 2.5 Pro's 4.9% and o3-mini-high's 3.0%.

Sources: Benchmark scores from Jolicoeur-Martineau et al. (2025), "Recursive Reinforcement Learning" (arXiv:2510.04871). Comparison model scores reported in the paper's Table 1 based on public evaluations. Note: At paper publication (Oct 2025), Claude had not yet released ARC-AGI-2 scores; Claude Opus 4 later scored 8.6%.

Having spent months working on ARC-AGI 2, I can tell you: these scores are significant.

The benchmark resists memorization. Every problem requires genuine reasoning: understanding abstract rules and applying them to novel situations. When Big Tech models struggle to break 5%, brute-force pattern matching simply doesn't solve these problems. Training quality isn't the issue.

TRM's approach (iterative refinement over many steps) seems particularly well-suited for this type of reasoning.

What This Actually Means

1. Different Problems, Different Tools

Different architectures excel at different things. Small models aren't inherently better than large ones.

Large language models bring massive knowledge bases, pattern matching at scale, broad capabilities, conversational fluency, and internet-scale training data. Tiny recursive models bring focused reasoning, iterative refinement, task specialization, computational efficiency, and learned problem-solving processes.

Both are valuable. Both have their place.

GPT-4 can write code, explain quantum physics, and have nuanced conversations. TRM solves logic puzzles really well. They're complementary, not competitors.

2. Efficiency Opens New Possibilities

The cost and compute requirements are genuinely remarkable. Training TRM costs under $500, takes 2 days on 4 GPUs, and is accessible to any research lab. Deployment is equally lightweight: 7M parameters, fast inference, and actually feasible for edge devices.

This opens doors that were previously closed:

New Possibilities

University researchers can experiment without massive budgets

Startups can build specialized reasoning systems affordably

Edge devices could run sophisticated reasoning locally

International researchers without datacenter access can contribute

3. Algorithm Innovation Still Matters

Scaling has worked brilliantly. From GPT-2 to GPT-4, we've seen incredible improvements from larger models trained on more data.

But TRM reminds us that algorithmic innovation matters too.

Instead of asking "how much bigger can we scale?", Samsung asked "what if we think differently about the architecture?" Both questions are valuable. Both drive progress.

What's exciting: we're exploring multiple paths simultaneously. Some problems might be best solved by massive LLMs. Others might benefit from smaller, specialized, iterative systems. Many will want hybrid approaches.

4. Hybrid Systems Are the Future

Here's what really gets me excited: What if we combine both paradigms?

LLMs bring knowledge retrieval, language understanding, and broad capabilities. TRM brings iterative refinement, focused reasoning, and efficiency. The hybrid potential: use LLMs for knowledge and TRM for reasoning. Best of both worlds: complementary strengths, not competition.

Imagine a system where an LLM provides world knowledge and language understanding, then hands off reasoning tasks to a TRM-style iterative refinement process. That's genuinely interesting.

The Limitations (Let's Be Real)

TRM excels at abstract reasoning puzzles. It won't replace your ChatGPT subscription. The model is designed for logic puzzles, not general conversation, writing, coding assistance, or knowledge retrieval.

From the paper: "Not all choices are guaranteed to be optimal on every dataset." The researchers themselves call for more investigation into when and why this approach works.

This is a proof-of-concept showing a different architectural approach can work well for certain problem classes, not a universal solution.

What Happens Next?

Samsung open-sourced the code. Right now, researchers worldwide are asking: Does this approach transfer to other reasoning domains? What if you give TRM 70M parameters instead of 7M? Can we combine TRM-style reasoning with LLM knowledge? What other tasks benefit from iterative refinement?

As someone who's spent considerable time thinking about how to test AI reasoning, I'm most excited about hybrid approaches. The future probably looks like combining different architectural innovations to solve different aspects of intelligence, not choosing between small models and large models.

The Bigger Picture

Here's what keeps me up at night after reading this paper: We're still early in understanding what architectures work best for what problems.

The encouraging takeaway: Scaling has worked incredibly well and will continue to improve. But there's still room for algorithmic innovation on specific problem classes. We can explore multiple approaches simultaneously. The field is healthier with diversity of ideas.

Working on ARC-AGI 2, I've watched teams throw progressively larger models at the benchmark. Some improvements came from scale. But watching TRM take a fundamentally different approach and succeed?

That's the kind of innovation that pushes the field forward.

The Real Lesson

Samsung's TRM shows us there's more than one path to progress.

What TRM Teaches Us

Algorithmic innovation still has room to run

Different architectures can excel at different problem types

Efficiency matters, especially for deployment and accessibility

We should explore multiple approaches, not bet everything on one paradigm

The future of AI probably looks like:

Large, powerful foundation models (continued scaling)
Specialized reasoning systems (TRM-style approaches)
Hybrid architectures (combining both)
Task-specific optimizations (knowing when to use what)

And that's exciting! The code is open source. Training costs $500. Multiple paradigms can coexist and learn from each other. We're still figuring this out.

More paths to explore means more innovation. And that's good for everyone.

Links & Resources

Primary Sources:

Paper: Jolicoeur-Martineau, A. (2025). "Recursive Reinforcement Learning" arXiv:2510.04871
Code: GitHub - SamsungSAILMontreal/TinyRecursiveModels
Author: Alexia Jolicoeur-Martineau, Samsung SAIL Montreal

Benchmarks & Comparisons:

ARC-AGI Public Leaderboard: arcprize.org/leaderboard
ARC-AGI-2 Announcement: Chollet, F. et al. (2025). Announcing ARC-AGI-2
Comparison Analysis: Multiple sources reported TRM's performance vs. frontier models including MarkTechPost, The Decoder

Building AI products and navigating these architectural decisions? I'm Ryan Brandt, founder of Vunda AI. I help companies build production-ready AI systems that actually work. Reach out: ryan@vunda.ai