2025 in Review: Jagged Intelligence Becomes a Fault Line

"Theseus Fighting the Centaur Bianor", by Antoine-Louis Barye, 1867

A year shaped by synthetic data, dramatically uneven performance, and reliability issues

One of the reasons why I write is reflection. Looking over 2025’s work, there are consistent themes among the mess that help me understand the velocity of AI, its momentum and direction. I’m not going to polish this too much (if you want to dive in, check out the linked posts), but this exercise is quite clarifying to me.

Here’s the tl;dr:

Immediate AI risk comes from people over-estimating AI capabilities.
Reliability and trust are the barriers preventing wide adoption.
Evaluations remain underutilized.
Synthetic data unlocked AI capabilities, but shapes its nature
There is a growing AI perception gap between quantitative users and qualitative users.
AI leaders are letting others define the story of AI.

Immediate AI risk comes from people over-estimating AI capabilities.

There are many risks we should be conscious of, but the downsides that are biting us now come from people believing in AI capabilities or sentience that isn’t there. “I don’t worry about superintelligent AGI’s taking over the world. I worry about bots convincing people they’re having an emotional connection when they’re not.” This can be tied to teen suicides, senior scams, propagandist bots, and more. The natural language interface is wonderful for its flexibility and accessibility, but it exploits our evolutionary tendency to recognize humans where there are none.

This danger is more pronounced by our current human-in-the-loop design pattern. We’re asking laypeople to evaluate AI capabilities in fields which they do not understand. Too often I hear, “Chatbots know everything, but they make mistakes when it comes to things I know.”

Posts:

Reliability and trust are the barriers preventing wide adoption.

As we saw above, people can easily spot issues with AI when it’s working in their domain. Sure, we’ve come a long way this year, but these gains have mostly come from Intern-style applications. We keep the humans in the loop because humans are excellent at spotting and fixing issues the 10% of the time models flail.

But when that figure is higher than ~10% or so (these are finger-in-the-air numbers), people simply avoid the AI. Agents, especially custom enterprise ones, have a reliability problem that hinders the development of the field. Teams that successfully ship agents do so by dialing back their complexity: chat interfaces, short tasks.

But we should consider reliability a means to an end; and that end is trust.

Trust is complex. It’s dependent on the task being done, the risk associated with the task, the UI that presents the task, and how the agent contextualizes the produced decision. Reliability can be measured at the model level, but trust has to be assessed end-to-end: from the model, to the application, to the user.

Frustratingly, there’s few good ways to measure trust in the AI era. We can do user interviews (and I know teams that do), but these are slow. UX research always has been, but their pace feels especially sluggish in the context of AI-powered development, Many teams can hack this by “vibe shipping” – making changes to their app, pushing to production, running a few queries, then repeating – basically doing the UX reseach by themselves, on themselves.

Everyone else should look to delegation. “Forget the benchmarks – the best way to track AI’s capabilities is to watch which decisions experts delegate to AI.”

Posts:

Evaluations remain underutilized.

At first I wrote, “under-appreciated.” But I think teams get why evaluations are valuable. The problem is most teams still don’t build them.

They get the benefits:

The real power of a custom eval isn’t just in model selection – it’s in the compound benefits it delivers over time. Each new model can be evaluated in hours, not weeks. Each prompt engineering technique can be tested systematically. And perhaps most importantly, your eval grows alongside your understanding of the problem space, becoming an increasingly valuable asset for your AI development.

It used to be I had to argue that hand-tuned prompts would become overfit to a model. But OpenAI’s headline model deprecations this year pushed many teams to discover this empirically.

Despite this hiccup, many teams continue to push forward, hand-editing prompts and vibe shipping as they go. Pre-scale, this is likely optimal: the speed of iteration this allows is too valuable to ignore. As a result, so many teams I talk to who were previously focusing on evaluation tooling have pivoted to synthetic data creation or LLM-as-a-Judge services. Our AI capabilities have improved dramatically, but human behavior remains a constraint.

Posts:

Synthetic data unlocked AI capabilities, but shapes its nature.

Investing in synthetic data creation unlocked AI capabilities in 2025. Rephrasing high quality content into reasoning and agentic chains kept the scaling party alive. Generating new datasets for verifiable tasks (like math and coding) helped AI coding apps evolve from better auto-complete services to async agents in less than a year.

Remember: Claude Code arrived in February.

Synthetic data did this. It provided the material needed for post-training, the mountains of examples necessary to upend an entire industry. But the limits of synthetic data, that it has been focused on quantitative tasks, greatly shapes our tools and discourse:

Those who use AIs for programming will have a remarkably different view of AI than those who do not. The more your domain overlaps with testable synthetic data and RL, the more you will find AIs useful as an intern. This perception gap will cloud our discussions.

The current solution, being deployed by frontier chatbots, is to treat everything they can as a programming problem. If ChatGPT or Claude can write a quick Python script to answer your question, it will. Context engineering challenges are being reframed as coding tasks: give a model a Python environment and let them explore, search, and read and write files. Yesterday’s harness is today’s environment. In 2024 we called models, today we call systems.

Scale was all we needed in 2024. Reasoning kept the party going in 2025. Coding will be the lever in 2026.

Posts:

There is a growing AI perception gap between quantitative users and qualitative users.

And this is the trillion dollar question: can we replicate our coding gains in qualitative fields? Can we generate synthetic data that unlocks better writing? Can we turn PowerPoint creation into a coding exercise? If we give GPT-5.2 a Python notebook can it write a better poem?

If these things can’t be solved with coding, there will be tremendous opportunity to improve the qualitative performance of models through other means. Doing so, however, will likely require solutions that are opinionated rather than general. Aesthetic performance requires subjective choices, not objective correctness.

But for now, the lopsided nature of today’s models is creating a world where programmers experience a very different AI than most ChatGPT users. The divide in capabilities between a free ChatGPT or Copilot account and Claude Code with Opus 4.5 is vast. Public conversations about AI are deeply unproductive because what you and I are experiencing is lightyears beyond the default experience.

Posts:

AI leaders are letting others define the story of AI.

Compounding this problem is the fact that AI leaders aren’t even attempting to explain how AI works to the masses. I recently wrote:

The AI ecosystem is repeating digital advertising’s critical mistake.

One of the reasons the open online advertising ecosystem fell apart is because they terribly communicated how it all worked. The benefits of cross targeting were brushed over, because it was hard and complex to explain, and that left the door open for others to make privacy the only story, until it was too late. Which created the environment we have now, where most quality media is paywalled and only the giant platforms have sufficient scale for effective targeting.

The AI industry is failing to explain how AI works. People and companies either brush it aside as complex and/or oversimplify it with over-promised metaphors (“A PHD in your pocket!”) These same people then get upset when critics keep wringing their hands about hallucinations, financial engineering, power and water consumption, and much more.

AI leaders don’t invest in explanations because AI is hard to explain. Further, they’re incentivized to over-simplify and over-promise. Combine this withthe lightning speed of development (even Karpathy feels left behind!) and AI’s jagged intelligence becomes into a fault line, threatening to rupture.

Posts:

DeepSeek as a Power Object