<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.4">Jekyll</generator><link href="https://www.dbreunig.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.dbreunig.com/" rel="alternate" type="text/html" /><updated>2026-03-06T16:35:42-08:00</updated><id>https://www.dbreunig.com/feed.xml</id><title type="html">Drew Breunig</title><subtitle>Writing about AI, geo, culture, media, data, and the ways they interact.</subtitle><author><name>Drew Breunig</name></author><entry><title type="html">Learnings from a No-Code Library: Keeping the Spec Driven Development Triangle in Sync</title><link href="https://www.dbreunig.com/2026/03/04/the-spec-driven-development-triangle.html" rel="alternate" type="text/html" title="Learnings from a No-Code Library: Keeping the Spec Driven Development Triangle in Sync" /><published>2026-03-04T07:06:00-08:00</published><updated>2026-03-04T18:11:17-08:00</updated><id>https://www.dbreunig.com/2026/03/04/the-spec-driven-development-triangle</id><content type="html" xml:base="https://www.dbreunig.com/2026/03/04/the-spec-driven-development-triangle.html"><![CDATA[<p>The following is a write up of a talk I delivered at <a href="https://mlops.community">MLOps Community’s</a> “Coding Agents” conference, on March 3rd. There’s <a href="https://www.youtube.com/watch?v=8TXAlOFkmk0">a video version of the talk available on YouTube</a>.</p>

<p>I share what I learned building a no-code library, why spec-driven development is a feedback loop not a straight line, historical parallels for our current moment, and a PoC tool for keeping specs/tests/code in sync.</p>

<p>Finally, we consider what GitHub should look like in the era of coding agents.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_001.jpg" alt="" /></p>

<p>I was invited here today to talk about a project I launched — a software library with no code — which got a lot of really interesting feedback. I’m going to tell you the whole story, how I got it wrong, explore a bit of historical context, then propose a path forward.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_002.jpg" alt="" /></p>

<p>Last Fall, Opus 4.5 launched and surprised everybody with the quality of the code it was produced and the problems it could solve. Opus 4.5 was good enough that we started to ask some really big questions.</p>

<p>I wondered: if the agents are good enough, why do we need to share code?</p>

<p>Whenever I have a big question like this, one that requires lots of thought, I like to go for a long bike ride. So I did, and while I was riding I came up with the idea to ship a software library with no code.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_003.jpg" alt="" /></p>

<p>And so we have <a href="https://www.dbreunig.com/2026/01/08/a-software-library-with-no-code.html"><strong>whenwords</strong></a>.</p>

<p>Open source, freely licensed. It’s a GitHub repository with a markdown file describing what the library is supposed to do. It’s a library that takes a Unix timestamp and converts it into something human-readable — “about 12 o’clock,” “five hours ago,” things like that.</p>

<p>I also generated about 750 conformance tests in YAML: given this input, I expect this output. And one more file called <code class="language-plaintext highlighter-rouge">install.md</code> — a single paragraph you’d paste into the agent of your choice, with all the instructions for building the code. You’d drop in what language you wanted and where to save it.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_004.jpg" alt="" /></p>

<p><code class="language-plaintext highlighter-rouge">whenwords</code> kicked off a lot of conversation about spec-driven development. It’s something more and more people are thinking about: the idea that if you bring specs, which define the <em>what</em>, <em>why</em>, and sometimes <em>how</em>, and tests, which measure and validate behavior, the code will just flow from that. Give it to an agent, get code out.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_005.jpg" alt="" /></p>

<p><code class="language-plaintext highlighter-rouge">whenwords</code> kind of blew up. <a href="https://x.com/karpathy/status/2015887154132746653">Karpathy was a fan</a>. <code class="language-plaintext highlighter-rouge">whenwords</code> has over 1,000 stars on <a href="https://github.com/dbreunig/whenwords">Github</a>.</p>

<p>What was even crazier was that I started getting normal GitHub interactions. People submitted issues. They submitted pull requests. And the pull requests were <em>good</em>, things like: “In this test, you’re expecting this result, but that violates the rounding rule you detail in the spec. You need to true these up.”</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_006.jpg" alt="" /></p>

<p>But I wasn’t the only one with this idea. Larger teams started shipping larger projects. whenwords was a toy; small, constrained, 750 tests. But then:</p>

<p><strong>Vercel</strong> released <a href="https://github.com/vercel-labs/just-bash">just-bash</a>, a simulated bash environment with an in-memory virtual filesystem, written in TypeScript. Basically re-implementing Bash in TypeScript. They’re running shell scripts against it to verify behavior.</p>

<p><strong>Pydantic</strong> released <a href="https://github.com/pydantic/monty">Monty</a>, a Python interpreter written in Rust. Fast, safe, ideal for agent REPLs and code use. Same approach: a pile of Python tests, throw it at the model, make it pass.</p>

<p><strong>Anthropic</strong> famously threw <a href="https://www.anthropic.com/engineering/building-c-compiler">16 Claudes and $20,000 at a spec suite to build a Rust-based C compiler</a>. It didn’t really work. But it was pretty cool.</p>

<p>I couldn’t stop thinking about Spec Driven Development and how far we might push it.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_007.jpg" alt="" /></p>

<p>I think there’s a few learnings from this first wave of Spec Driven Development.</p>

<p><strong>Tests and specs aren’t free or easy.</strong> All the projects we surveyed used large existing testing libraries from existing projects: the Bash tests, the Python tests, the C tests. Those are the low-hanging fruit. I joked online (and I’m not the first) that pretty soon anyone who wants to protect themselves is going to be like SQLite, where the code is free but you’ve got to pay for the tests. Tests are precious.</p>

<p><strong>Implementation is fast, but not instant.</strong> You go fast at first, but none of these projects are complete. just-bash is still being worked on. Monty is missing JSON and other standard libraries. Anthropic’s C compiler stalled out. It gets hard. It’s not perfect and it’s not free.</p>

<p><strong>As complexity grows, structural choices become more important.</strong> This was especially clear in the Anthropic project. They got incredibly far, down to 1% of failing tests. But every time they fixed a new bug, it broke something else. Systemic changes required systemic thinking, not just local fixes.</p>

<p><strong>Architectures that allow parallel development are incredibly valuable.</strong> What it allows you to do is move fast with multiple agents. And, this is something I haven’t seen explored yet, it allows for open source contribution. Rather than spending $20,000 to build a C compiler, what if you structured it so everyone knows what chunk they can work on? It’s like SETI@home, except I’m not using your engineering expertise. I’m using your Claude Code subscription. Which I think is wild.</p>

<p>But the biggest learning — and the one we’re going to spend the most time on today — is that <strong>sometimes the spec and tests aren’t sufficient.</strong></p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_008.jpg" alt="" /></p>

<p>One of my favorite things to do is look at the PRs and issues for all of these libraries. Even with a great spec — “make it run Python perfectly in Rust, here are all the tests, just make them pass” — there are still 20-comment threads about what the <em>right</em> way to implement something is.</p>

<p>Because no spec is perfect. And this is probably my biggest takeaway today:</p>

<p><strong>Implementing the code helps us improve our spec.</strong></p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_009.jpg" alt="" /></p>

<p>Let’s take a digression. We’re at the Computer History Museum, so let’s go back into history. Specifically the history of code and managing code.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_010.jpg" alt="" /></p>

<p>One of my favorite jokes about AI development is one I stole from Matt Levine, who writes the finance newsletter <a href="https://www.bloomberg.com/account/newsletters/money-stuff">Money Stuff</a>. In it, he has a running joke about crypto people speed-running financial history, from first principles, as they attempt to build new financial infrastructure. <em>We are doing that with software engineering and AI coding.</em></p>

<p>I’m lucky: one of my co-founders, Heather Miller, is a professor at CMU and a programming languages expert. I can call her up, share my theories, and ask: “Heather, tell me this has already been dealt with. Who should I be talking to and what shoud I read?” This time, she said, “Of course it has, Drew,” and introduced me to her office neighbor, Professor Claire Le Goues. Claire then who walked me through the relevant software engineering history, which I’m going to share today because it is incredibly relevant to our current situation.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_011.jpg" alt="" /></p>

<p>In 1963, Margaret Hamilton was writing and managing software effort for NASA’s Apollo missions. She coined the term “software engineering” because, running this giant, complicated project that couldn’t have errors in it, she realized: this is <em>engineering</em>. It’s systems design, we have to worry about errors, we have to worry about unexpected inputs like astronauts pushing the wrong button.</p>

<p>And also: we now have enough code that no one person can hold it in their head. Which is a problem, because then you can’t reason about it effectively. And it gets even worse when a team is working on it.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_012.jpg" alt="" /></p>

<p>By the way: this is <em>her</em> code. This is what she was managing. This is her VS Code.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_013.jpg" alt="" /></p>

<p>And this is her Git.</p>

<p>I’m a dad, which means dad jokes come naturally. So I’m going to retroactively coin Hamilton’s Law: <strong>when you can’t see over your code, you can’t oversee your code.</strong></p>

<p>(Sorry.)</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_014.jpg" alt="" /></p>

<p>After Hamilton dealt with this problem, others realized it was a problem too.</p>

<p>NATO held a conference in Berlin and identified the “<em>Software Crisis</em>”: computer hardware now allowed programs so complex they couldn’t be managed adequately. A single engineer couldn’t hold all the code in their head. If they were going to continue delivering on what software could promise, they needed process.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_015.jpg" alt="" /></p>

<p>Dijkstra popularized this in his 1972 Turing Award lecture. He said:</p>

<blockquote>
  <p>As long as there were no machines, programming was no problem at all. When we had a few weak computers, programming became a mild problem. And now we have gigantic computers, programming has become an equally gigantic problem.</p>
</blockquote>

<p>He said this in 1972. Maybe later, walk around the museum we’re in and look at what he was working with then. Then consider what we’re working with today.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_016.jpg" alt="" /></p>

<p>So after the Software Crisis emerged, we wandered through the desert of <em>processes</em>, searching for one to borrow. We looked at manufacturing engineering. In 1975, Brooks published <em>The Mythical Man-Month</em>. And finally, Waterfall was adopted as a DoD standard. We learned how to engineer complex software. Progress.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_017.jpg" alt="" /></p>

<p>But these things move in cycles. In 2001, we published the Agile Manifesto. Zuckerberg said it’s time to move fast and break things. We embraced Agile, and Agile was finally realized by the cloud and GitHub — which enabled continuous CI/CD and let us offload enough of the error-checking that we didn’t break things <em>too</em> often, even when moving fast.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_018.jpg" alt="" /></p>

<p>Which brings us to today.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_019.jpg" alt="" /></p>

<p>I added this slide right at the last minute, because I logged into Twitter to check something and saw today’s trending news: <em>“AI Coding Boosts Output But Overwhelms Human Reviews.”</em> And it’s paired with that last headline: <em>“OpenAI Codex Leaks Hint at GPT-5.4 Amid Speedy Updates.”</em> So not only is it overwhelming us, it’s accelerating.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_020.jpg" alt="" /></p>

<p>So what do we learn from this history rabbit hole?</p>

<p>Being overwhelmed by the volume of code isn’t a new problem. It’s what <em>birthed</em> software engineering.</p>

<p>The initial Software Crisis was our inability to manage complex codebases new computers allowed. Our current Software Crisis is our inability to manage complex codebases new models allow.</p>

<p>Our problem used to be that we couldn’t hold an entire codebase in our head. Now we can’t even <em>read</em> our entire codebase.</p>

<p>Agentic engineering enables waterfall volume at the cadence of agile. And even that undersells it: it’s waterfall times ~two at the cadence of agile times ~seven.</p>

<p>We keep oscillating, historically, between unhindered velocity and managed process. We could use some process right about now. Perhaps AI can help…</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_021.jpg" alt="" /></p>

<p>I’m not the only one asking this question.</p>

<p>For the last couple of quarters, people have been trying to figure out how to deal with this onslaught of code. The most dramatic example is Gas Town — you’re all familiar with it — an infrastructure for managing a coding process that grew beyond one person’s ability to manage.</p>

<p>But Gas Town just moves the problem. It doesn’t solve it. Steve Yegge even admits this in <a href="https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16dd04">the Gas Town blog post</a>:</p>

<blockquote>
  <p>Gas Town is complicated. Not because I wanted it to be, but because I had to keep adding components until it was a self-sustaining machine. And the parts that it now has, well, they look a lot like Kubernetes mated with Temporal and they had a very ugly baby together.</p>
</blockquote>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_022.jpg" alt="" /></p>

<p><strong>If the process is complex, we’re just moving the problem.</strong></p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_023.jpg" alt="" /></p>

<p>So let’s go back to what we defined spec-driven development to be. This idea that it’s an equation: bring specs, maybe add some tests, add an agent, get code out.</p>

<p>I got this wrong. This is the wrong way to think about it. Because this isn’t a one-way equation. It’s a feedback loop. The act of writing code improves the spec, and it improves the tests. Just like software doesn’t really work until it meets the real world, a spec doesn’t really work until it’s implemented.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_024.jpg" alt="" /></p>

<p>So instead of an equation, I propose a triangle. The spec defines what tests need to be written, and what code needs to be written. Tests validate the code. That’s essentially what we had before, just in a different shape.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_025.jpg" alt="" /></p>

<p>But the act of implementing code generates new decisions. Those decisions inform the spec. And when the spec updates, new tests need to be written. And sometimes it’s not new decisions — it’s just dependencies or subtle choices. New code surfaces new behaviors that need to be tested.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_026.jpg" alt="" /></p>

<p>I call this: <strong>the Spec-Driven Development Triangle.</strong></p>

<p>As each node moves forward, our job — and our tooling’s job — is to keep those nodes in sync. That’s the job. If we improve the code, we must improve the spec.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_027.jpg" alt="" /></p>

<p>But keeping the nodes in sync is hard.</p>

<p>Writing tests is hard. Even before agents, we couldn’t write tests. We don’t like writing tests and we’d prefer not to.</p>

<p>Writing specs is hard. They can never be exhaustive, leave room for interpretation, and are written before the software meets the real world. The spec gets written, it gets implemented, it gets released. Is the spec updated? No.</p>

<p>Specs are written at a different cadence than code, in a different medium. If only we had something that could read natural language.</p>

<p>Updating specs and tests feels like overhead, especially when you’re moving fast. And the entire point of using agents is to move fast. Any system we design has to respect that.</p>

<p>Implementation is messy, and often humans and LLMs take shortcuts. Humans say “I’m not going to implement that right now” or “I’ll come back and fix this.” LLMs certainly do this.</p>

<p>And so regular reconciliation of tests, spec, and code is not part of the process.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_028.jpg" alt="" /></p>

<p>But thankfully, there are signals we can work with.</p>

<p>Code changes are tracked by Git, and we can compare them against the spec to find gaps.</p>

<p>Test coverage tools tell us what code is tested — but not whether the tests reflect the spec. It’s not just about covering the code. The tests have to cover the spec.</p>

<p>Updates to the spec — if a product manager logs in and changes something — are also tracked by Git. Is the rest of the system changing with it?</p>

<p>Bug reports and hotfixes that go straight into code or tests need to be captured and rolled into the spec.</p>

<p>And most importantly: implementing the code with an agent generates decisions — from both the humans and the agent. Those decisions exist in the traces. We can look at the traces from our coding agents and find where decisions were made. That’s the signal we need to keep everything in sync.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_029.jpg" alt="" /></p>

<p>So we have tangible things we can analyze. And a goal to aim towards…</p>

<p>One of the the nice thing about having a thought experiment during the era of great coding agents is that you can <em>just try building it</em>. And as you implement it, you improve it.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_030.jpg" alt="" /></p>

<p>This is my tool. I call it <strong>Plumb</strong>, after a plumb bob, because it keeps things true. A plumb bob hangs from a line and helps a carpenter keep things straight. Even better, they used to be held on tripods, which echoes the triangle.</p>

<p>You can install it right now: <code class="language-plaintext highlighter-rouge">pip install plumb-dev</code> or <code class="language-plaintext highlighter-rouge">uv add plumb</code>.</p>

<p>It’s not perfect. It’s a proof of concept. A thought experiment as code. But I’ve been using it, and it’s pretty great.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_031.jpg" alt="" /></p>

<p>Here’s how Plumb works.</p>

<p>Plumb is a command line tool. Every time you’re working with an agent and you run <code class="language-plaintext highlighter-rouge">git commit</code>, it identifies decisions made by evaluating the code diff from the last commit and by reading the agent traces (all the conversations since that last commit). It extracts the decisions, dedupes them, and presents them to you: <em>here are all the decisions you made, do you agree?</em></p>

<p>Once you’ve approved, it updates the spec to reflect those decisions. It runs <code class="language-plaintext highlighter-rouge">sync</code> and reports on coverage gaps between the spec and the tests, and the spec-to-code coverage. Is the code actually reflecting what the spec defines?</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_032.jpg" alt="" /></p>

<p>As it does this, it generates files that become artifacts you can track. My favorite is a big JSONL file of <em>decisions</em>.</p>

<p>Here’s one example: <em>“Should spec updates be batched across all decisions, or run individually for each decision?”</em> My decision — batch them. It says it was made by the user, not the LLM. I have blame. And you can see how we can enrich this over time: tie it to code, to branch, to whether it was informed by the conversation, when it was approved, when it was synced. This is not just the code changes. It’s the intent.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_033.jpg" alt="" /></p>

<p>To set up Plumb in your project: install it, go to your project directory, run <code class="language-plaintext highlighter-rouge">plumb init</code>. It’ll ask you to specify your spec markdown file or folder and show it where your tests are. It creates a <code class="language-plaintext highlighter-rouge">.plumbignore</code> to tell it when to skip decision generation — changing the README, for example, doesn’t need to generate decisions. It creates a <code class="language-plaintext highlighter-rouge">.plumb</code> folder to store state and config. Very similar to <code class="language-plaintext highlighter-rouge">.git</code>.</p>

<p>Most importantly: it adds hooks to Git. When you run <code class="language-plaintext highlighter-rouge">git commit</code>, it extracts the decisions. If there are decisions to review, the commit fails. It exits and tells you to review your decisions and approve, reject, or edit them. That’s what makes this work anywhere: command line, CI pipeline, inside your coding environment. It just works. And that’s a hard requirement.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_034.jpg" alt="" /></p>

<p>The other thing Plumb generates is a breakdown of your spec into individual requirements — the atomic statements that make up what your spec defines. Ambiguous or not, what source file it came from, eventually linked directly to the code. Right now I use a commenting format to link tests back to the requirement they’re testing, so coverage mapping can show which requirements have tests and how many.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_035.jpg" alt="" /></p>

<p>Our aim is link spec to requirements, requirements to code, requirements to tests, decisions to requirements. We’re building a new object graph extending off the code diffs. And eventually — edit the spec, the tests, or the code, pick your poison, and everything else gets brought along.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_036.jpg" alt="" /></p>

<p>Now, as you design this, the interesting design choices start to emerge.</p>

<p><strong>Can’t this just be a skill?</strong> There are already code review skills, superpowers, things like that. Why not just use those?</p>

<p>I don’t think it can be a skill. Whatever tool we end up using for tracking decisions and intent, it cannot live only inside the agent. It needs to run outside. It needs to handle small commits, triggers, anything…even if you never touch the agent.</p>

<p>A skill is a suggestion. A tool needs to be a checkpoint. That commit-fail mode is essential. Otherwise it gets ignored. We’ve all had this happen with Claude Code.</p>

<p>And the system needs to be canonical. It can’t be optional. Agents wander. Validation needs to be more deterministic. When we can use code, we will. This is a validation and verification step. Fuzzy LLM calls are a last resort.</p>

<p>When we do use LLMs — parsing the spec, extracting decisions — we use DSPy. It lets us structure LLM calls with tight inputs and outputs. It lets us optimize, test, and choose which models to route to. Speed matters enormously here. For decision deduplication, I’m routing to GPT, because it’s faster than anything Anthropic offers for that task. And the whole thing has to be simple enough for the developer to hold in their head.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_037.jpg" alt="" /></p>

<p>Of course, there are real limitations.</p>

<p>Plumb only supports pytest. I want it to support any test framework and conformance tests, such as language-agnostic tests like <code class="language-plaintext highlighter-rouge">whenwords</code> used.</p>

<p>Decisions might interrupt your flow on long-running tasks. If I make a quick fix and generate five decisions, I have to sit through a review. That needs to be tunable. Maybe you don’t want it to bother you for lightweight decisions, only surface things that are vague or contradict previous decisions. I suspect this is something that will be dictated by the type of project you’re working on.</p>

<p>Deduplication isn’t perfect. Decision identification is fuzzy and will likely need to be project-specific.</p>

<p>Code reversals on decision rejection aren’t working yet. I’d like it so that when you reject a decision the LLM made, it goes back and undoes it. The reason it’s not implemented is that the flow needs to be right: if you reject from the command line, nothing automatic should happen. If you reject from inside the agent, the agent should act on it.</p>

<p>It needs better tools for managing the spec. Mine has grown long and probably should be sharded into sections. Thankfully, this is something an LLM can and should do. Though, we have to be careful when doing it. Perhaps we can perform dry runs, regenerate requirements from the shards, then confirm they match the original spec…</p>

<p>Plumb should be tunable for “just enough” structure. Can I run with <code class="language-plaintext highlighter-rouge">--dangerously-approve-all-decisions</code>? Sometimes I want to.</p>

<p>And it’s untested on large projects. Hell, it’s untested in general.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_038.jpg" alt="" /></p>

<p>But here’s the fun part: I’ve been testing this by using it to manage the project itself. Using Plumb to build Plumb. And it’s been genuinely useful.</p>

<p>Claude can refer to the spec for implementation understanding without searching the entire codebase. The decision log has proven valuable for answering “why does this code exist?” — I can ask the agent, “is there a decision we made that explains why this is implemented this way?” And it can find it.</p>

<p>It’s code review, but code review where we capture intent. When I hit commit in Claude Code, I get a list of decisions and I step through them. Sometimes I hit one I don’t like and I stop right there. I reject it, go back, redo it. I like that better than pure code review.</p>

<p>It actually spots and controls weird silent LLM behavior. We’ve all let an agent run while we answer email and come back to something insane. Now I get a decision and I can say “don’t do that, let’s roll that back.”</p>

<p>And hacks get documented. I’ve taken shortcuts in this app. Now I know they exist. I can search back for all the shortcuts and then go fix them. The decision log becomes an artifact — not just of code changes, but of intent.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_039.jpg" alt="" /></p>

<p>So let’s take this question further. Say Plumb exists and does exactly what I want. <em>How could GitHub be better with this kind of information?</em></p>

<p>Right now, the main way we interact with code is with Markdown and chat. And GitHub has not changed <em>anything</em> about how we interact with Markdown and text on their site. Could my Markdown diffs have decisions linked to them, so I can see how intent manifests in the code?</p>

<p>I think any version of GitHub that takes the agentic era seriously needs to do four things:</p>

<p><strong>Spec, tests, and code have to be first-class citizens.</strong> Code is already. Tests are close — GitHub Actions gets you there. But Markdown is not. Microsoft is probably leaving a lot of inference revenue on the table by not treating it seriously.</p>

<p><strong>Markdown has to be a first-class citizen.</strong> This is the gap.</p>

<p><strong>We need to see the linkages.</strong> Users need to follow connections between decisions, requirements, code, tests, and spec. Spec-driven development right now is treated as a one-shot thing: write the spec, hit go, you’re done. It’s not. It’s a process. You need to track all of it over time.</p>

<p><strong>Users should be able to ask questions of the system.</strong> Not just read it — query it, to get closer to understanding intent. That’s how you actually understand a codebase that’s too large to read.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_040.jpg" alt="" /></p>

<p>So here are my takeaways from the journey from whenwords to Plumb.</p>

<p><strong>Code implementation clarifies and communicates intent.</strong> I could stop there and walk out of the room. I missed this with <code class="language-plaintext highlighter-rouge">whenwords</code>.</p>

<p><strong>The job is to keep specs, code, and tests in sync as they move forward.</strong> The system for managing that has to stay simple. If it creates developer mental overhead, it just moves the problem somewhere else.</p>

<p><strong>The act of writing code improves the spec and the tests.</strong> Just like software doesn’t truly work until it meets the real world, a spec doesn’t truly work until it’s implemented.</p>

<p><strong>No-code libraries are toys because they are unproven.</strong></p>

<p><strong>Even if you aren’t the one making decisions during implementation, decisions are being made.</strong> We should leverage LLMs to extract and structure those decisions.</p>

<p>And finally: <strong>we’ve been here before.</strong> The answer then was process. The answer now is <em>also</em> process. And just as we leverage cloud compute to enable CI/CD for agile, we should leverage LLMs to build something lightweight enough that we can <em>fit in our heads</em>, <em>doesn’t slow us down</em>, and helps us <em>make sense of our software</em>.</p>

<hr />

<p><img src="/img/sdd_triangle/sdd_triangle_041.jpg" alt="" /></p>

<p>Again: thank you very much to <a href="https://clairelegoues.com">Professor Claire Le Goues</a>, who helpfully walked me through the history of computer science. The history section of this talk is entirely thanks to her. And <a href="https://www.amazon.com/Try-Turning-Off-Again-Weirdness/dp/0691294097/ref=sr_1_1?crid=3LIVIZU338F1J&amp;dib=eyJ2IjoiMSJ9.fyNgXWzlKFZz2ACK5pxUeQMAciv-s4HcET4tuumDVG0.xYw8JSdzVLiL0hYljkAToOc07qkcmpv_Rqs8ePtr7mw&amp;dib_tag=se&amp;keywords=Claire+Le+Goues&amp;qid=1772648451&amp;sprefix=claire+le+goues%2Caps%2C164&amp;sr=8-1">she has a book coming out</a>, aimed at a wider audience, later this year. Do check it out.</p>

<hr />

<form action="https://buttondown.com/api/emails/embed-subscribe/dbreunig" method="post" target="popupwindow" onsubmit="window.open('https://buttondown.com/dbreunig', 'popupwindow')" class="embeddable-buttondown-form">
  <label for="bd-email">Enter your email to receive the occasional update.</label>
  <div class="form-input">
    <input type="email" name="email" id="bd-email" placeholder="your@email.com" />
    <input type="submit" value="Subscribe" />
  </div>
</form>

<style>
img {
    border: 1px solid #ccc;
}
</style>]]></content><author><name>Drew Breunig</name></author><category term="ai" /><category term="coding" /><category term="software engineering" /><category term="spec driven development" /><summary type="html"><![CDATA[The following is a write up of a talk I delivered at MLOps Community’s “Coding Agents” conference, on March 3rd. There’s a video version of the talk available on YouTube.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.dbreunig.com/img/sdd_triangle/sdd_triangle_026.jpg" /><media:content medium="image" url="https://www.dbreunig.com/img/sdd_triangle/sdd_triangle_026.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">We’re Talking About Terms of Use, But the Issue is Embedded Judgment</title><link href="https://www.dbreunig.com/2026/03/01/the-issue-is-embedded-judgment.html" rel="alternate" type="text/html" title="We’re Talking About Terms of Use, But the Issue is Embedded Judgment" /><published>2026-03-01T09:13:00-08:00</published><updated>2026-03-01T09:48:46-08:00</updated><id>https://www.dbreunig.com/2026/03/01/the-issue-is-embedded-judgment</id><content type="html" xml:base="https://www.dbreunig.com/2026/03/01/the-issue-is-embedded-judgment.html"><![CDATA[<h3 id="the-biggest-buyers-will-want-to-audit-and-influence-post-training">The biggest buyers will want to audit and influence post-training</h3>

<p>Beneath the Anthropic and Department of War fracas, there is a legitimate &amp; essential conversation to be had about how much control any organization has when deeply adopting an AI model they didn’t train.</p>

<p>These are probabilistic systems, with near infinite surface area to test, that are intentionally <em>designed</em>. Models are used to inform and make decisions, and they all have embedded perspectives.</p>

<p><strong>AI is unlike other technology purchases because AI has embedded judgment.</strong></p>

<p>I’m not sure what the answer is here, only that we need to have this discussion (calmly) and that anyone who tells you this isn’t a problem, that their model has an objective God-view-from-nowhere, is selling you something.</p>

<p>Let me be clear: I agree strongly with Anthropic’s usage red lines. I gladly choose Claude myself.</p>

<p>But this conversation is being framed badly around usage. Many are talking about Anthropic’s “terms of service” (notably, both Hegseth and Trump even capitalized the term in their tweets), but I think allowed usage terms are red herring. The issue is embedded judgment.</p>

<p>If I were in military procurement, I would certainly some big questions about what “soul documents” or “<a href="https://www.anthropic.com/news/claude-new-constitution">constitutions</a>” (or similar) are embedded in <em>any</em> model being considered for embedding throughout the armed forces (and <em>all</em> the labs make design choices during post-training).</p>

<p>And clearly this is something Anthropic is already dealing with! This section, from the above blog post, suddenly becomes much more interesting:</p>

<blockquote>
  <p>This constitution is written for our mainline, general-access Claude models. We have some models built for specialized uses that don’t fully fit this constitution; as we continue to develop products for specialized use cases, we will continue to evaluate how to best ensure our models meet the core objectives outlined in this constitution.</p>
</blockquote>

<p>We don’t know if post-training control helped blow up the deal (I tend to believe the issue was about allowed usage, based on the administration’s and Anthropic’s statements, coupled with OpenAI’s announced terms). But I think it’s a safe bet many militaries will insist on influencing and auditing the post-training for their purchased variants.</p>

<p>I wrote back in 2023 that <a href="https://www.dbreunig.com/2023/12/21/a-model-for-every-culture.html">I expect states and cultures to build their own models</a> for related reasons; I wasn’t thinking about defense tech at the time but it certainly amplifies the issues.</p>

<p>One takeaway: this is a strong argument for why the AI race isn’t going to be winner-take-all. Everyone wants a champion to trust.</p>]]></content><author><name>Drew Breunig</name></author><category term="ai" /><category term="policy" /><summary type="html"><![CDATA[The biggest buyers will want to audit and influence post-training]]></summary></entry><entry><title type="html">Two Beliefs About Coding Agents</title><link href="https://www.dbreunig.com/2026/02/25/two-things-i-believe-about-coding-agents.html" rel="alternate" type="text/html" title="Two Beliefs About Coding Agents" /><published>2026-02-25T14:12:00-08:00</published><updated>2026-02-25T15:23:00-08:00</updated><id>https://www.dbreunig.com/2026/02/25/two-things-i-believe-about-coding-agents</id><content type="html" xml:base="https://www.dbreunig.com/2026/02/25/two-things-i-believe-about-coding-agents.html"><![CDATA[<p>There’s a lot of noise about how AI is changing programming these days. It can be a bit overwhelming.</p>

<p>If you hang out on social media, you’ll hear wild claims about people running 12 agents at once, for days. Or people hacking bots together, giving them $10k, and letting them roam the web.</p>

<p>The challenge with all of this is that coding agents <em>really are</em> performing some science fiction feats which were barely imaginable just 12 months ago. But at the same time, the ecosystem is incentivizing the most outlandish claims, so punters keep telling tall tales. Separating the signal from the noise is near impossible.</p>

<p>I’m lucky enough to talk to a range of developers and teams, spanning a variety of company sizes and a broad array of skill sets. From these conversations, two beliefs have emerged and solidified about coding agents and their (current) impact on coding.</p>

<p>Let’s start with belief number one:</p>

<p><strong>Most talented developers do not appreciate the impact of the intuitive knowledge they bring to their coding agent.</strong></p>

<p>We’ve all seen the posts by developer luminaries. They haven’t written code in weeks. They gave a hard problem to Claude Code or Codex and <em>it just worked</em>.</p>

<p>But what we don’t see is their prompts. And having seen <em>many</em> prompts by <em>many</em> types of devs, I would wager their prompts are relatively specific and offer more guidance to the LLM than your average user. And these specifics don’t have to be exhaustive. Even knowing the right terms to use can have enormous impact and activate an entirely different set of weights in the model than someone writing, “the search is broken fix it.”</p>

<p>Skilled programmers, with plenty of experience, don’t even think about how to ask correctly. They just do, intuitively. And things work well. If the agent and dev go through multiple turns, this effect gets even more significant.</p>

<p>I wish we could see more prompts and traces, from a wide range of developers, to better understand the range of code. And, just as interestingly, how hard and long agents have to work to achieve the goal. For now we can just browse public repos on Github, where the range of coding quality is quite broad.</p>

<p>Which brings me to the second belief:</p>

<p><strong>Most work people are sharing are incredible personal tools, but they are not capital-P products.</strong></p>

<p>There’s an app I really like called “<a href="https://streetpass.social">StreetPass</a>.” It’s a browser extension that watches web pages you visit and collects Mastodon accounts it finds, letting you easily follow them if you wish. It’s small and charming. A perfect extension.</p>

<p>Recently, I realized I wanted a version of StreetPass, but for RSS feeds instead of Mastodon accounts. I forked StreetPass, fired up Claude Code, and had <a href="https://github.com/dbreunig/feedpass">a working version quickly</a>. You can use this, but I’m not supporting it. I won’t be pushing it to the App Store or Chrome Web Store. I won’t be building a version that doesn’t leverage <a href="https://feedbin.com">Feedbin</a>. I have no idea if it works on Chrome or Firefox. It’s personal software that I use almost daily.</p>

<p>Most agentic coding projects we see being hyped are like this.</p>

<p>All those things I won’t do, those are the things that would turn my <em>personal software</em> into a <em>Product</em>. And we haven’t even gotten to marketing, support, and more. As we covered when we <a href="https://www.dbreunig.com/2026/02/21/why-is-claude-an-electron-app.html">touched on Claude’s desktop app</a>, the last 10% of product development and support is where the pain is. And that’s still a long road. As they say: <a href="https://www.dbreunig.com/2026/02/06/the-rise-of-spec-driven-development.html">Code today is free, as in puppies</a>.</p>

<p>But I want to be clear about couple things.</p>

<p>First, I know many teams shipping agent written code into products. But they test, support, review, and so much more. But when we make big claims like “coding is solved” or “code is free”, we need to be clear about <em>what</em> we’re talking about building<sup id="fnref:grady"><a href="#fn:grady" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>.</p>

<p>Second, our ability to manifest personal software easily <em>is amazing</em> and powerful. I am continually inspired by the things people build (for example, I loved <a href="https://simonwillison.net/2026/Feb/25/present/">Simon’s presentation software he whipped up for FOO Camp</a>). His presentation app is so tailored to him, in the past the math would never justify the time spent building it to support a market of maybe a dozen. But now he gets his dream!</p>

<p>Similarly, my RSS finder extension is a feature not an app and (sadly) there isn’t a large market for RSS today. But with Claude Code (and open source code to build upon!) I can build just what I wanted in moments.</p>

<hr />

<p>I am sure as our scaffolding and models improve, this stuff will get more accessible and more resilient, but I don’t expect these two beliefs to go away. Providing AI with the right instructions to obtain <em>just</em> what you want, will always be a challenge.</p>

<p>Coding agents amplify existing skills.</p>

<hr />

<form action="https://buttondown.com/api/emails/embed-subscribe/dbreunig" method="post" target="popupwindow" onsubmit="window.open('https://buttondown.com/dbreunig', 'popupwindow')" class="embeddable-buttondown-form">
  <label for="bd-email">Enter your email to receive the occasional update.</label>
  <div class="form-input">
    <input type="email" name="email" id="bd-email" placeholder="your@email.com" />
    <input type="submit" value="Subscribe" />
  </div>
</form>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:grady">
      <p><a href="https://x.com/Grady_Booch/status/2026736492488568955">Grady Booch</a> has a good post about this today. Things are getting higher level, and changing fast, but engineering remains. <a href="#fnref:grady" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Drew Breunig</name></author><category term="ai" /><category term="software development" /><summary type="html"><![CDATA[There’s a lot of noise about how AI is changing programming these days. It can be a bit overwhelming.]]></summary></entry><entry><title type="html">Why is Claude an Electron App?</title><link href="https://www.dbreunig.com/2026/02/21/why-is-claude-an-electron-app.html" rel="alternate" type="text/html" title="Why is Claude an Electron App?" /><published>2026-02-21T10:00:00-08:00</published><updated>2026-02-25T15:11:53-08:00</updated><id>https://www.dbreunig.com/2026/02/21/why-is-claude-an-electron-app</id><content type="html" xml:base="https://www.dbreunig.com/2026/02/21/why-is-claude-an-electron-app.html"><![CDATA[<h3 id="if-code-is-free-why-arent-all-apps-native">If code is free, why aren’t all apps native?</h3>

<p>The state of coding agents can be summed up by <a href="https://x.com/dbreunig/status/2024970389156495365?s=46">this fact</a></p>

<blockquote>
  <p>Claude spent $20k on an agent swarm implementing (kinda) a C-compiler in Rust, but desktop Claude is an Electron app.</p>
</blockquote>

<p>If you’re unfamiliar, Electron is a coding framework for building desktop applications using web tech, specifically HTML, CSS, and JS. What’s great about Electron is it allows you to  build one desktop app that supports Windows, Mac, and Linux. Plus it lets developers use existing web app code to get started. It’s great for teams big and small. <a href="https://en.wikipedia.org/wiki/List_of_software_using_Electron?wprov=sfti1">Many apps you probably use every day are built with Electron</a>: Slack, Discord, VS Code, Teams, Notion, and more.</p>

<p>There are downsides though. Electron apps are bloated; each runs its own Chromium engine. The minimum app size is usually a couple hundred megabytes. They are often laggy or unresponsive. They don’t integrate well with OS features.</p>

<p>(These last two issues <em>can</em> be addressed by smart development and OS-specific code, but they rarely are. The benefits of Electron (one codebase, many platforms, it’s just web!) don’t incentivize optimizations outside of HTML/JS/CSS land.)</p>

<p>But these downsides are dramatically outweighed by the ability to build and maintain one app, shipping it everywhere.</p>

<p>But now we have coding agents! <a href="https://www.dbreunig.com/2026/02/06/the-rise-of-spec-driven-development.html">And one thing coding agents are proving to be pretty good at is cross-platform, cross-language implementations given a well-defined spec and test suite</a>.</p>

<p>On the surface, this ability should render Electron’s benefits obsolete! Rather than write one web app and ship it to each platform, we should write <em>one spec and test suite</em> and use coding agents to ship <em>native</em> code to each platform. If this ability is real and adopted, users get snappy, performant, native apps from small, focused teams serving a broad market.</p>

<p>But we’re still leaning on Electron. Even Anthropic, one of the leaders in AI coding tools, who keeps publishing flashy agentic coding achievements, still uses Electron in the Claude desktop app. And it’s slow, buggy, and bloated app.</p>

<p><em>So why are we still using Electron and not embracing the agent-powered, spec driven development future?</em></p>

<p>For one thing, coding agents are <em>really</em> good at the first 90% of dev. But that last bit – nailing down all the edge cases and continuing support once it meets the real world – remains hard, tedious, and requires plenty of agent hand-holding.</p>

<p>Anthropic’s <a href="https://www.anthropic.com/engineering/building-c-compiler">Rust-base C compiler</a> slammed into this wall, after screaming through the bulk of the tests:</p>

<blockquote>
  <p>The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.</p>
</blockquote>

<p>The resulting compiler <em>is</em> impressive, given the time it took to deliver it and the number of people who worked on it, but it is largely unusable. That last mile is <em>hard</em>.</p>

<p>And this gets even worse once a program meets the real world. Messy, unexpected scenarios stack up and development never really ends. Agents make it easier, sure, but hard product decisions become challenged and require human decisions.</p>

<p>Further, with 3 different apps produced (Mac, Windows, and Linux) the surface area for bugs and support increases 3-fold. Sure, there are local quirks with Electron apps, but most of it is mitigated by the common wrapper. Not so with native!</p>

<p>A good test suite and spec <em>could</em> enable the Claude team to ship a Claude desktop app native to each platform. But the resulting overhead of that last 10% of dev and the increased support and maintenance burden will remain.</p>

<p>For now, Electron still makes sense. Coding agents are amazing. But the last mile of dev and the support surface area remains a real concern.</p>

<hr />

<p>Over at <a href="https://news.ycombinator.com">Hacker News</a>, Claude Code’s <a href="https://borischerny.com">Boris Cherney</a> <a href="https://news.ycombinator.com/item?id=47106368">chimes in</a>:</p>

<blockquote>
  <p>Boris from the Claude Code team here.</p>

  <p>Some of the engineers working on the app worked on Electron back in the day, so preferred building non-natively. It’s also a nice way to share code so we’re guaranteed that features across web and desktop have the same look and feel. Finally, Claude is great at it.</p>

  <p>That said, engineering is all about tradeoffs and this may change in the future!</p>
</blockquote>

<p>There we go: developer familiarity and simpler maintainability across multiple platforms is worth the “tradeoffs”. We have incredible coding agents that are great at transpilation, but there remain costs that outweigh the costs of shipping a non-native app.</p>

<hr />

<form action="https://buttondown.com/api/emails/embed-subscribe/dbreunig" method="post" target="popupwindow" onsubmit="window.open('https://buttondown.com/dbreunig', 'popupwindow')" class="embeddable-buttondown-form">
  <label for="bd-email">Enter your email to receive the occasional update.</label>
  <div class="form-input">
    <input type="email" name="email" id="bd-email" placeholder="your@email.com" />
    <input type="submit" value="Subscribe" />
  </div>
</form>]]></content><author><name>Drew Breunig</name></author><category term="ai" /><category term="software development" /><category term="spec driven development" /><summary type="html"><![CDATA[If code is free, why aren’t all apps native?]]></summary></entry><entry><title type="html">How System Prompts Define Agent Behavior</title><link href="https://www.dbreunig.com/2026/02/10/system-prompts-define-the-agent-as-much-as-the-model.html" rel="alternate" type="text/html" title="How System Prompts Define Agent Behavior" /><published>2026-02-10T21:34:00-08:00</published><updated>2026-02-25T15:11:53-08:00</updated><id>https://www.dbreunig.com/2026/02/10/system-prompts-define-the-agent-as-much-as-the-model</id><content type="html" xml:base="https://www.dbreunig.com/2026/02/10/system-prompts-define-the-agent-as-much-as-the-model.html"><![CDATA[<p><em>This post was <a href="https://blog.nilenso.com/blog/2026/02/10/how-system-prompts-define-agent-behaviiour/">co-authored with Srihari Sriraman</a></em></p>

<p>Coding agents are fascinating to study. They help us build software in a new way, while themselves exemplifying a novel approach to architecting and implementing software. At their core is an AI model, but wrapped around it is a mix of code, tools, and prompts: the harness.</p>

<p>A critical part of this harness is the system prompt, the baseline instructions for the application. This context is present in every call to the model, no matter what skills, tools, or instructions are loaded. The system prompt is always present, defining a core set of behaviors, strategies, and tone.</p>

<p>Once you start analyzing agent design and behavior, a question emerges: how much does the system prompt actually determine an agent’s effectiveness? We take for granted that the model is the most important component of any agent, but how much can a system prompt contribute? Could a great system prompt paired with a mediocre model challenge a mediocre prompt paired with a frontier model?</p>

<p>To find out, we obtained and analyzed system prompts from six different coding agents. We clustered them semantically, comparing where their instructions diverged and where they converged. Then we swapped system prompts between agents and observed how behavior changed.</p>

<p>System prompts matter far more than most assume. A given model sets the theoretical ceiling of an agent’s performance, but the system prompt determines whether this peak is reached.</p>

<hr />

<h3 id="the-variety-of-system-prompts">The Variety of System Prompts</h3>

<p>To understand the range of system prompts, we looked at six CLI coding agents: Claude Code, Cursor, Gemini CLI, Codex CLI, OpenHands, and Kimi CLI. Each performs the same basic function: given a task they gather information, understands the code base, writes code, tracks their progress, and runs commands. But despite these similarities, the system prompts are <em>quite</em> different.</p>

<p><img src="/img/waffle_comparison.jpg" alt="Waffle chart comparisons of 6 coding agent system prompts" /></p>

<div class="callout callout--try-it-out">
  <span class="callout__label">Try It Out</span>
  <div class="callout__content">
    
<p><a href="https://nilenso.github.io/context-viewer/g/960d42ad-314c-44cf-8594-4b009ef528a1/comparison?sidebar=0&amp;panel=0&amp;sortBy=category&amp;sortDir=asc&amp;import=https://raw.githubusercontent.com/nilenso/long-prompts-analysis/refs/heads/main/context-viewer-exports/system-prompts-simpler.json">Explore the above figures interactively in <strong>context viewer</strong></a>.</p>

  </div>
</div>

<p>We’re analyzing <a href="https://github.com/asgeirtj/system_prompts_leaks">exfiltrated system prompts</a>, which we clean up and <a href="https://github.com/nilenso/long-prompts-analysis/tree/main/data/prompts/filtered">host here</a><sup id="fnref:exfiltrated"><a href="#fn:exfiltrated" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>. Each of these is fed into <a href="https://github.com/nilenso/context-viewer">context-viewer</a>, a tool Srihari developed that chunks contexts in semantic components for exploration and analysis.</p>

<p>Looking at the above visualizations, there is plenty of variety. Claude, Codex, Gemini, and OpenHands roughly prioritize the same instructions, but vary their distributions. Further, prompts for Claude Code and OpenHands both are less than half the length of prompts in Codex and Gemini.</p>

<p>Cursor’s and Kimi’s prompts are dramatically different. Here we’re looking at Cursor’s prompt that’s paired with GPT-5 (<a href="https://www.adiasg.com/blog/comparing-cursors-prompts-across-models">Cursor uses slightly different prompts when hooked to different models</a>), and it spends over a third of its tokens on personality and steering instructions. Kimi CLI, meanwhile, contains zero workflow guidance, barely hints at personality instructions, and is the shortest prompt by far.</p>

<p>Given the similar interfaces of these apps, we’re left wondering: why are their system prompts so different?</p>

<p>There’s two main reasons the system prompts vary: <em>model calibration</em> and <em>user experience</em>.</p>

<p>Each model has its own quirks, rough edges, and baseline behaviors. If the goal is to produce a measured, helpful TUI coding assistant, each system prompt will have to deal with and adjust for unique aspects of the underlying model to achieve this goal. This <em>model calibration</em> reins in problematic behavior.</p>

<p>System prompts also vary because they specify slightly different <em>user experience</em>. Sure, they’re all text-only, terminal interfaces that explore and manipulate code. But some are more talkative, more autonomous, more direct, or require more detailed instructions. System prompts define this UX and, as we’ll see later, we can make a coding agent “feel” like a different agent just by swapping out the system prompt.</p>

<p>We can get a glimpse of these two functions together by looking at how a given system prompt changes over time, especially as new versions of models arrive. For example:</p>

<p><img src="/img/claude_over_time.jpg" alt="Claude's system prompt vascilates as new models are released, but trends steadily longer" /></p>

<div class="callout callout--try-it-out">
  <span class="callout__label">Try It Out</span>
  <div class="callout__content">
    
<p><a href="https://nilenso.github.io/context-viewer/g/b179a05f-2bd4-4012-83ab-42a0cb1e79fd/comparison?sidebar=0&amp;panel=0&amp;legend=compact&amp;sortBy=category&amp;sortDir=asc&amp;cols=5&amp;import=https://raw.githubusercontent.com/nilenso/long-prompts-analysis/refs/heads/main/context-viewer-exports/claude-prompt-evolution-export-simpler.json">Explore the above figures interactively in <strong>context viewer</strong></a>. Or, <a href="https://nilenso.github.io/context-viewer/g/56b68fb5-7221-4c04-807e-b590f138c1fe/comparison?sidebar=0&amp;panel=0&amp;view=tokens-absolute&amp;legend=compact&amp;sortBy=category&amp;sortDir=asc&amp;cols=10&amp;spr=4&amp;import=https://raw.githubusercontent.com/nilenso/long-prompts-analysis/refs/heads/main/context-viewer-exports/codex-prompt-evolution-export-only-codex.json">check out Codex’s system prompt evolution in similar detail</a>.</p>

  </div>
</div>

<p>Note how the system prompt isn’t stable, nor growing in a straight line. It bounces around a bit, as the Claude Code team tweaks the prompt to both adjust new behaviors and smooth over the quirks of new models. Though the trend is a march upward, as the coding agent matures.</p>

<p>If you want to dive further into Claude Code’s prompt history, Mario Zechner has <a href="https://cchistory.mariozechner.at">an excellent site</a> where he highlights the exact changes from version to version.</p>

<div class="callout callout--go-deeper">
  <span class="callout__label">Go Deeper</span>
  <div class="callout__content">
    
<p>Sometimes instructions are just…<em>weird</em>. Srihari <a href="https://blog.nilenso.com/blog/2026/02/12/weird-system-prompt-artefacts/">cataloged some of the odder instructions he found while exploring coding agent system prompts</a>.</p>

  </div>
</div>

<hr />

<h3 id="the-common-jobs-of-a-coding-agent-system-prompt">The Common Jobs of a Coding Agent System Prompt</h3>

<p>While these prompts vary from tool to tool, there are many commonalities that each prompt features. There is clear evidence that these teams are <a href="https://www.dbreunig.com/2025/11/11/don-t-fight-the-weights.html">fighting the weights</a>: they use repeated instructions, all-caps admonishments, and stern warnings to adjust common behaviors. This shared effort suggests common patterns in their training datasets, which each has to mitigate.</p>

<p>For example, there are <em>many</em> notes about how these agents should use comments in their code. Cursor specifies that the model should, “not add comments for trivial or obvious code.” Claude states there should be no added comments, “unless the user asks you to.” Codex takes the same stance. Gemini instructions the model to, “Add code comments sparingly… NEVER talk to the user through comments.”</p>

<p>These consistent, repeated instructions are warranted. They fight against examples of conversation in code comments, present in countless codebases and Github repo. This behavior goes deep: we’ve even seen that Opus 4.5 will <a href="https://x.com/aidenybai/status/1993901129210712129">reason in code comments if you turn off thinking</a>.</p>

<p>System prompts also repeatedly specify that tool calls should be parallel whenever possible. Claude should, “maximize use of parallel tool calls where possible.” Cursor is sternly told, “CRITICAL INSTRUCTION: involve all relevant tools concurrently… DEFAULT TO PARALLEL.” Kimi adopts all-caps as well, stating, “you are HIGHLY RECOMMENDED to make [tool calls] in parallel.”</p>

<p>This likley reflects the face that most post-training reasoning and agentic examples are <em>serial</em> in nature. This is perhaps easier to debug and a bit of delay when synthesizing these datasets isn’t a hinderence. However, in real world situations, users certainly appreciate the speed, so system prompts need to override this training.</p>

<p>Both of these examples of <em>fighting the weights</em> demonstrate how system prompts are used to smooth over the quirks of each model (which they pick up during training) and improve the user experience in an agentic coding application.</p>

<p>Much of what these prompts specify is shared; common adjustments, common desired behaviors, and common UX. But their differences notably affect application behavior.</p>

<div class="callout callout--go-deeper">
  <span class="callout__label">Go Deeper</span>
  <div class="callout__content">
    
<p>Srihari looked at more examples of fighting the weights to understand <a href="https://blog.nilenso.com/blog/2026/02/12/how-system-prompts-reveal-model-biases/">how system prompts reveal model biases</a>.</p>

  </div>
</div>

<hr />

<h3 id="do-the-prompts-change-the-agent">Do the Prompts Change the Agent?</h3>

<p>Helpfully, <a href="https://opencode.ai">OpenCode</a> <a href="https://opencode.ai/docs/modes/#prompt">allows users to specify custom system prompts</a>. With this feature, we can drop in prompts from Kimi, Gemini, Codex and more, removing and swapping instructions to measure their contribution.</p>

<p>We gave SWE-Bench Pro test questions to two applications: two agents running the OpenCode harness, calling Opus 4.5, but with one one using the Claude Code system prompt and the other armed with Codex’s instructions.</p>

<p>Time and time again, the agent workflows diverged immediately. For example:</p>

<p><img src="/img/prompt_swap.jpg" alt="Claude's system prompt defines a more iterative agent" /></p>

<p>The Codex prompt produced a methodical, documentation-first approach: understand fully, then implement once. The Claude prompt produced an iterative approach: try something, see what breaks, fix it.</p>

<p>This pattern remains consistent over many SWE Bench problems. If we average the contexts for each model and system prompt pair, we get the following:</p>

<p><img src="/img/swe_bench_prompts.jpg" alt="Swapping system prompts yielded different behavior for each model" /></p>

<div class="callout callout--try-it-out">
  <span class="callout__label">Try It Out</span>
  <div class="callout__content">
    
<p><a href="https://nilenso.github.io/context-viewer/g/67175678-6244-45bc-b022-238b72f8e646/comparison?sidebar=0&amp;panel=0&amp;legend=compact&amp;sortBy=category&amp;sortDir=asc&amp;cols=5&amp;import=https://raw.githubusercontent.com/nilenso/long-prompts-analysis/refs/heads/main/context-viewer-exports/swapping-prompts-swe-tasks.json">Explore the above figures interactively in <strong>context viewer</strong></a>.</p>

  </div>
</div>

<p>All prompt-model combinations correctly answered this subset of SWE Bench Pro questions. But <em>how</em> they suceeded was rather different. The system prompts shaped the workflows.</p>

<div class="callout callout--go-deeper">
  <span class="callout__label">Go Deeper</span>
  <div class="callout__content">
    
<p>Srihari explored <a href="https://blog.nilenso.com/blog/2026/02/12/codex-cli-vs-claude-code-on-autonomy/">Codex CLI and Claude Code autonomy</a>, and how the system prompt may shape their behavior.</p>

  </div>
</div>

<hr />

<h3 id="system-prompts-deserve-more-attention">System Prompts Deserve More Attention</h3>

<p>Last week, when Opus 4.6 and Codex 5.3 landed, people began putting them through the paces, trying to decide which would be their daily driver. Many tout the capabilities of one option over another, but just as often are complaints about approach, tone, or other discretionary choices. Further, it seems every week brings discussion of a new coding harness, especially for managing swarms of agents.</p>

<p>There is markedly less discussion about the system prompts that define the behaviors of these agents<sup id="fnref:foursix"><a href="#fn:foursix" class="footnote" rel="footnote" role="doc-noteref">2</a></sup>. System prompts define the UX and smooth over the rough edges of models. They’re given to the model with <em>every</em> instruction, yet we prefer to talk Opus vs. GPT-5.3 or Gastown vs. Pi.</p>

<p>Context engineering starts with the system prompt.</p>

<hr />

<form action="https://buttondown.com/api/emails/embed-subscribe/dbreunig" method="post" target="popupwindow" onsubmit="window.open('https://buttondown.com/dbreunig', 'popupwindow')" class="embeddable-buttondown-form">
  <label for="bd-email">Enter your email to receive the occasional update.</label>
  <div class="form-input">
    <input type="email" name="email" id="bd-email" placeholder="your@email.com" />
    <input type="submit" value="Subscribe" />
  </div>
</form>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:exfiltrated">
      <p>Exfiltrated system prompts represent versions of the system prompt for a given session. It’s not 100% canonical, as many AI harnesses assemble system prompts from multiple snippets, given the task at hand. But given the consistent manner with which we can extrac these prompts, and comparing them with <a href="https://platform.claude.com/docs/en/release-notes/system-prompts">public</a> <a href="https://github.com/openai/codex/blob/d452bb3ae5b5e0f715bba3a44d7d30a51b5f28ae/codex-rs/core/prompt.md">examples</a>, we feel they are sufficiently representative for this analysis. <a href="#fnref:exfiltrated" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:foursix">
      <p>Though you can use Mario’s <a href="https://cchistory.mariozechner.at/?from=2.1.31&amp;to=2.1.34">system prompt diff tool to explore the changes accompanying Opus 4.6’s release</a>. <a href="#fnref:foursix" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Drew Breunig</name></author><category term="AI" /><category term="Agents" /><category term="System Prompts" /><category term="Context" /><category term="Collaboration" /><category term="Context Engineering" /><summary type="html"><![CDATA[System prompts matter far more than most assume. A given model sets the theoretical ceiling of an agent’s performance, but the system prompt determines whether this peak is reached.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.dbreunig.com/img/waffle_comparison.jpg" /><media:content medium="image" url="https://www.dbreunig.com/img/waffle_comparison.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">The Potential of RLMs</title><link href="https://www.dbreunig.com/2026/02/09/the-potential-of-rlms.html" rel="alternate" type="text/html" title="The Potential of RLMs" /><published>2026-02-09T09:42:00-08:00</published><updated>2026-02-25T15:11:53-08:00</updated><id>https://www.dbreunig.com/2026/02/09/the-potential-of-rlms</id><content type="html" xml:base="https://www.dbreunig.com/2026/02/09/the-potential-of-rlms.html"><![CDATA[<h3 id="handling-your-long-context-today--designing-your-agent-tomorrow">Handling Your Long Context Today &amp; Designing Your Agent Tomorrow</h3>

<p><img src="/img/thumb_index.jpg" alt="" /></p>

<h3 id="context-rot-is-the-worst-context-failure">Context Rot is the Worst Context Failure</h3>

<p>“Context Rot” is a common problem agent designers must avoid and mitigate.</p>

<p>The <a href="https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf">Gemini 2.5 paper</a> was one of the first technical reports that flagged the issue, noting that the performance of their <a href="https://www.dbreunig.com/2025/06/17/an-agentic-case-study-playing-pok%C3%A9mon-with-gemini.html">Pokémon-playing harness</a> rapidly deteriorated as the context grew beyond 100,000 tokens; a figure <em>far</em> below Gemini 2.5’s 1 million input token limit. We covered this in our <a href="https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html#context-distraction">context failures piece</a>, but <a href="https://research.trychroma.com/context-rot">the Chroma team published the canonical exploration</a> of the effect, dubbing it <strong>context rot</strong>.</p>

<p><img src="/img/context_rot.jpg" alt="A figure from the Chroma post showing performance declining as the input length increases" /></p>

<p>A key takeaway from Gemini’s Pokémon troubles and the Chroma post is that context rot is not a <em>capacity problem</em>. It’s a <em>quality problem</em>. As the context grows beyond a model’s soft limit, the model continues to issue output as its accuracy declines. This makes for a pernicious problem, one that sneaks up on us the longer we run agents.</p>

<p>Of all the <a href="https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html">context fails</a>, context rot is the worst.</p>

<hr />

<h3 id="enter-recursive-language-models">Enter Recursive Language Models</h3>

<p>Defined by Alex Zhang and Omar Khattab, <a href="https://alexzhang13.github.io/blog/2025/rlm/">Recursive Language Models</a> (or RLMs) are a simple idea:</p>

<ol>
  <li>Load long context into a <a href="https://en.wikipedia.org/wiki/Read%E2%80%93eval%E2%80%93print_loop">REPL</a> environment<sup id="fnref:repl"><a href="#fn:repl" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>, stored as variables.</li>
  <li>Allow an LLM to use the REPL environment to explore and analyze the context.</li>
  <li>Provide a function in the REPL to trigger a sub-LLM call.</li>
</ol>

<p>That’s it. That’s an RLM. The LLM will use the REPL to filter, chunk, and sample the long context as needed to complete its task. It will use the sub-LLM function to task new LLM instances to explore, analyze, or validate the context. Eventually, the sum of the LLM’s findings will be synthesized into a final answer.</p>

<p>With this setup, the long context(s) can be <em>really</em> long. I’ve given RLMs logfiles more than 400 megabytes in size, with no issues. In the original <a href="https://alexzhang13.github.io/blog/2025/rlm/">RLM post</a>, Alex reports that performance doesn’t degrade when &gt;10 million tokens are provided.</p>

<p><img src="/img/rlm_plot.jpg" alt="Alex Zhang's RLM paper shows RLM significantly mitigates the context rot decline." /></p>

<p>Note the orange lines on the right: as the context length increases, performance very slowly degrades, hovering around 50-60%. Compare this to the non-RLM results (with the same GPT-5 model), which dramatically decline until failing entirely at 262,000 tokens.</p>

<hr />

<h3 id="rlms-work-by-turning-long-context-problems-into-coding--reasoning-problem">RLMs Work By Turning Long Context Problems Into Coding &amp; Reasoning Problem</h3>

<p>The key attribute of RLMs is that they maintain two distinct pools of context: <em>tokenized context</em> (which fills the LLM’s context window) and <em>programmatic context</em> (information that exists in the coding environment). By giving the LLM access to the REPL, where the <em>programmatic context</em> is managed, the LLM controls what moves from <em>programmatic</em> space to <em>token</em> space.</p>

<p>And it turns out modern LLMs are quite good at this!</p>

<p>Let’s look at an example.</p>

<p>Here I’ve given Kimi K2 a very large <a href="https://huggingface.co/datasets/andyyang/stable_diffusion_prompts_2m">dataset of Stable Diffusion prompts</a> (prompts people provided to generate images). I then ask the RLM to identify the most common celebrities used in these prompts (and of course, I’m <a href="https://www.cmpnd.ai/blog/rlms-in-dspy.html">using RLM in DSPy</a>). If you’re curious, <a href="https://gist.github.com/dbreunig/afdd86cb560847f54359dcc3ee233766">here’s the code</a>.</p>

<p>I give the RLM a budget of 5 iterations to accomplish the task. Below, you can swipe/page through each iteration, which shows the LLM’s reasoning and the code it executed in the REPL. There’s a few things to keep in mind as you read through:</p>

<ul>
  <li>Every time the LLM calls <code class="language-plaintext highlighter-rouge">print</code> in the REPL, it’s bringing new context into the <em>token</em> space. (I’ve omitted this output for brevity)</li>
  <li>When the LLM calls <code class="language-plaintext highlighter-rouge">llm_query</code> (highlighted in blue) in the REPL, it’s tasking another LLM instance with a sub-call. It stores the result of this function as a variable, usually.</li>
  <li>On the last iteration, the LLM calls a special function <code class="language-plaintext highlighter-rouge">SUBMIT</code>, which indicates it has finished with the task.</li>
</ul>

<p>Click through and read, it really illuminates how a RLM works:</p>

<!--
  RLM Log Visualizer — inline blog component
  Hardwired with celebrities example data.
  Swipe / arrow-key / click to advance through iterations.
-->
<div id="rlm-viz-root">
  <style>
    #rlm-viz-root {
      --rlm-font: charter, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen-Sans, Ubuntu, Cantarell, "Helvetica Neue", sans-serif;
      --rlm-heading-font: charter, "Helvetica Neue", Helvetica, Arial, sans-serif;
      --rlm-mono: monospace;
      --rlm-text: #333;
      --rlm-muted: #777;
      --rlm-light: #999;
      --rlm-primary: #6face8;
      --rlm-primary-light: rgba(111, 172, 232, 0.2);
      --rlm-callout-bg: #ebf5fb;
      --rlm-border: lightgray;
      --rlm-bg: #fff;
      font-family: var(--rlm-font);
      font-size: 1.1rem;
      line-height: 1.5;
      color: var(--rlm-text);
      position: relative;
      margin: 2em 0;
      -webkit-font-smoothing: antialiased;
      -moz-osx-font-smoothing: grayscale;
    }

    /* ── Container ── */
    #rlm-viz-root .rlm-container {
      border: 1px solid var(--rlm-border);
      overflow: hidden;
      background: var(--rlm-bg);
    }

    /* ── Navigation bar (top) ── */
    #rlm-viz-root .rlm-nav-bar {
      display: flex;
      align-items: center;
      justify-content: space-between;
      padding: 0.6em 1em;
      border-bottom: 1px solid var(--rlm-border);
      background: var(--rlm-callout-bg);
    }

    #rlm-viz-root .rlm-nav-btn {
      display: inline-flex;
      align-items: center;
      gap: 0.3em;
      width: auto;
      padding: 0.5em 1.2em;
      border: 1px solid var(--rlm-border);
      border-radius: 4px;
      background: var(--rlm-bg);
      color: var(--rlm-text);
      font-family: var(--rlm-font);
      font-size: 14px;
      cursor: pointer;
      transition: background 0.15s;
      user-select: none;
      -webkit-user-select: none;
      flex: none;
    }

    #rlm-viz-root .rlm-nav-btn:hover:not(:disabled) {
      background: var(--rlm-primary-light);
    }

    #rlm-viz-root .rlm-nav-btn:disabled {
      opacity: 0.3;
      cursor: default;
    }

    #rlm-viz-root .rlm-nav-center {
      display: flex;
      align-items: center;
      gap: 8px;
    }

    #rlm-viz-root .rlm-dot {
      width: 8px;
      height: 8px;
      border-radius: 50%;
      background: #ccc;
      cursor: pointer;
      transition: background 0.2s, transform 0.2s;
      flex-shrink: 0;
    }

    #rlm-viz-root .rlm-dot:hover {
      background: #aaa;
      transform: scale(1.3);
    }

    #rlm-viz-root .rlm-dot.active {
      background: var(--rlm-primary);
      transform: scale(1.3);
    }

    #rlm-viz-root .rlm-dot.visited {
      background: rgba(111, 172, 232, 0.5);
    }

    /* ── Content area ── */
    #rlm-viz-root .rlm-viewport {
      position: relative;
      overflow: hidden;
      min-height: 180px;
    }

    #rlm-viz-root .rlm-slide {
      padding: 1.5em;
      opacity: 1;
      transition: opacity 0.2s ease;
    }

    #rlm-viz-root .rlm-slide.fading {
      opacity: 0;
    }

    /* ── Iteration label ── */
    #rlm-viz-root .rlm-iter-label {
      font-size: 14px;
      color: var(--rlm-muted);
      margin-bottom: 1em;
      display: flex;
      align-items: baseline;
      gap: 0.6em;
    }

    #rlm-viz-root .rlm-iter-label .rlm-iter-num {
      font-weight: 600;
      font-family: var(--rlm-heading-font);
    }

    #rlm-viz-root .rlm-iter-label .rlm-ts {
      font-family: var(--rlm-mono);
      font-size: 11px;
      color: var(--rlm-light);
    }

    /* ── Reasoning ── */
    #rlm-viz-root .rlm-reasoning {
      margin-bottom: 1.5em;
    }

    #rlm-viz-root .rlm-reasoning p {
      font-size: 16px;
      line-height: 1.4em;
      color: var(--rlm-text);
      margin: 0 0 1em 0;
    }

    #rlm-viz-root .rlm-reasoning p:last-child {
      margin-bottom: 0;
    }

    #rlm-viz-root .rlm-reasoning ul,
    #rlm-viz-root .rlm-reasoning ol {
      margin: 1em 0;
    }

    #rlm-viz-root .rlm-reasoning ul li {
      list-style-type: disc;
      list-style-position: outside;
      margin-bottom: 0.4em;
      margin-left: 2em;
      font-size: 16px;
      line-height: 1.4em;
    }

    #rlm-viz-root .rlm-reasoning ol li {
      list-style-type: decimal;
      list-style-position: outside;
      margin-bottom: 0.4em;
      margin-left: 2em;
      font-size: 16px;
      line-height: 1.4em;
    }

    #rlm-viz-root .rlm-reasoning li strong {
      font-weight: bold;
    }

    #rlm-viz-root .rlm-reasoning code {
      font-family: var(--rlm-mono);
      font-size: 0.8em;
      display: inline;
      padding: 0.2em 0.4em;
      background-color: var(--rlm-primary-light);
      border-radius: 0.2em;
      word-wrap: break-word;
    }

    #rlm-viz-root .rlm-reasoning strong {
      font-weight: 700;
    }

    #rlm-viz-root .rlm-reasoning em {
      font-style: italic;
    }

    /* ── Code block ── */
    #rlm-viz-root .rlm-code {
      font-size: 0.8em;
      display: block;
      white-space: pre;
      word-wrap: break-word;
      padding: 1rem;
      border-radius: 0.2rem;
      background-color: var(--rlm-primary-light);
      overflow-x: auto;
      font-family: var(--rlm-mono);
      line-height: 1.5;
      color: var(--rlm-text);
    }

    /* ── Footer ── */
    #rlm-viz-root .rlm-footer {
      padding: 0.5em 1em;
      border-top: 1px solid var(--rlm-border);
      font-size: 11px;
      color: var(--rlm-light);
      text-align: center;
    }

    /* ── Syntax highlighting — matches blog Pygments theme ── */
    #rlm-viz-root .py-comment { color: #408080; font-style: italic; }
    #rlm-viz-root .py-string { color: #BA2121; }
    #rlm-viz-root .py-keyword { color: #008000; font-weight: bold; }
    #rlm-viz-root .py-builtin { color: #008000; }
    #rlm-viz-root .py-special { color: #0000FF; font-weight: bold; }
    #rlm-viz-root .py-number { color: #666666; }
    #rlm-viz-root .py-decorator { color: #AA22FF; }

    /* ── Touch / swipe ── */
    #rlm-viz-root .rlm-viewport { touch-action: pan-y; }

    /* ── Responsive ── */
    @media (max-width: 900px) {
      #rlm-viz-root .rlm-slide {
        padding: 1em;
      }
    }
  </style>

  <div class="rlm-container" tabindex="0">
    <div class="rlm-nav-bar">
      <button class="rlm-nav-btn" id="rlm-prev" disabled="">
        &larr; Prev
      </button>
      <div class="rlm-nav-center" id="rlm-progress"></div>
      <button class="rlm-nav-btn" id="rlm-next">
        Next &rarr;
      </button>
    </div>
    <div class="rlm-viewport" id="rlm-viewport">
      <div class="rlm-slide" id="rlm-slide"></div>
    </div>
    <div class="rlm-footer">swipe or use arrow keys to navigate</div>
  </div>

  <script src="https://cdn.jsdelivr.net/npm/marked@12.0.0/marked.min.js"></script>
  <script>
  (function() {
    'use strict';

    // ── Hardwired data ──
    var DATA = {"turns":[{"iteration":1,"timestamp":"11:58:47","reasoning":"I need to extract the top topics related to \"celebrities and famous people\" from a large list of image generation prompts. The data is quite large (32M characters), so I should start by exploring a small sample to understand the structure and content.\n\nFirst, let me examine a small portion of the prompts to see what kind of celebrity/famous person mentions might be in there, then I can develop a strategy for extracting and counting them efficiently.","code":"# Let's start by examining a sample of the prompts to understand the data\nsample_size = 10000  # Start with a small sample\nsample_prompts = image_gen_prompts[:sample_size]\n\nprint(f\"Sample size: {len(sample_prompts)} characters\")\nprint(\"\\nFirst 1000 characters of sample:\")\nprint(sample_prompts[:1000])\nprint(\"\\n\" + \"=\"*50)\n\n# Let's also check what the category is asking for\nprint(f\"Category: {category}\")"},{"iteration":2,"timestamp":"11:58:54","reasoning":"I can see from the sample that there are indeed celebrity mentions in the prompts - I spotted \"Jeon Jungkook\" (BTS member), \"Scarlett Johannsen\", and \"kid rock\". The category is asking for \"celebrities and famous people\". \n\nSince the full dataset is 32M characters, I need an efficient approach. I should:\n1. Split the data into manageable chunks\n2. Use the LLM to identify celebrity mentions in each chunk\n3. Aggregate and count the results\n4. Return the top topics\n\nLet me start by processing a larger sample to see the pattern of celebrity mentions, then develop a systematic approach.","code":"# Let's examine more of the data to understand the pattern of celebrity mentions\nlarger_sample = image_gen_prompts[:100000]  # 100k characters\nprompts_list = larger_sample.split('\\n')\n\nprint(f\"Total prompts in sample: {len(prompts_list)}\")\nprint(\"\\nFirst 20 prompts:\")\nfor i, prompt in enumerate(prompts_list[:20]):\n    print(f\"{i+1}: {prompt}\")\n\n# Let's use LLM to identify celebrity mentions in this sample\ncelebrity_extraction_prompt = f\"\"\"\nFrom the following list of image generation prompts, identify all mentions of celebrities and famous people.\nReturn only the celebrity names, one per line, in lowercase.\n\nExample:\nscarlett johansson\nkid rock\njeon jungkook\n\nPrompts:\n{larger_sample[:5000]}  # Just use first 5000 chars for this test\n\nCelebrity names:\n\"\"\"\n\ncelebrities_found = llm_query(celebrity_extraction_prompt)\nprint(f\"\\nCelebrities found in sample:\")\nprint(celebrities_found)"},{"iteration":3,"timestamp":"11:59:00","reasoning":"Great! The LLM successfully identified celebrity mentions from the sample. I can see it found several celebrities including \"jeon jungkook\", \"scarlett johannsen\", \"kid rock\", \"donald trump\", etc. The approach is working.\n\nNow I need to scale this up to process the entire 32M character dataset efficiently. Since I have a limited number of LLM calls (max 50), I should:\n\n1. Process the data in larger chunks to minimize the number of calls\n2. Use the LLM to extract celebrities from each chunk\n3. Aggregate all the results\n4. Count the frequency of each celebrity\n5. Return the top topics\n\nLet me process the full dataset in manageable chunks. I'll aim for chunks of about 500k-1M characters each to balance between coverage and LLM call efficiency.","code":"# Process the full dataset in chunks\nchunk_size = 800000  # ~800k characters per chunk\ntotal_chars = len(image_gen_prompts)\nnum_chunks = (total_chars // chunk_size) + 1\n\nprint(f\"Total dataset size: {total_chars:,} characters\")\nprint(f\"Processing in {num_chunks} chunks of ~{chunk_size:,} characters each\")\n\nall_celebrities = []\n\n# Process each chunk\nfor i in range(0, total_chars, chunk_size):\n    chunk = image_gen_prompts[i:i+chunk_size]\n    chunk_num = (i // chunk_size) + 1\n    \n    print(f\"Processing chunk {chunk_num}/{num_chunks}...\")\n    \n    celebrity_extraction_prompt = f\"\"\"\n    From the following list of image generation prompts, identify all mentions of celebrities and famous people.\n    Return only the celebrity names, one per line, in lowercase.\n    \n    Example:\n    scarlett johansson\n    kid rock\n    jeon jungkook\n    \n    Prompts (chunk {chunk_num} of {num_chunks}):\n    {chunk[:min(len(chunk), 40000)]}  # Limit to first 40k chars of chunk to stay within LLM limits\n    \n    Celebrity names:\n    \"\"\"\n    \n    celebrities_in_chunk = llm_query(celebrity_extraction_prompt)\n    \n    if celebrities_in_chunk.strip():\n        # Parse the response - split by lines and clean up\n        celebrities_list = [name.strip().lower() for name in celebrities_in_chunk.strip().split('\\n') if name.strip()]\n        all_celebrities.extend(celebrities_list)\n        print(f\"Found {len(celebrities_list)} celebrities in this chunk\")\n    else:\n        print(\"No celebrities found in this chunk\")\n\nprint(f\"\\nTotal celebrities collected: {len(all_celebrities)}\")\nprint(\"Sample of found celebrities:\")\nprint(all_celebrities[:20])"},{"iteration":4,"timestamp":"12:00:08","reasoning":"Great! I've successfully processed all 41 chunks of the dataset and collected 1,128 celebrity mentions. Now I need to:\n\n1. Count the frequency of each celebrity to find the most popular ones\n2. Handle potential duplicates (like \"donald trump\" vs \"trump\", \"scarlett johansson\" vs \"scarlett johannsen\")\n3. Extract the top topics based on frequency\n4. Submit the final results\n\nLet me count the frequencies and get the top celebrities.","code":"from collections import Counter\n\n# Count frequencies of all celebrities\ncelebrity_counts = Counter(all_celebrities)\n\nprint(f\"Total unique celebrities found: {len(celebrity_counts)}\")\nprint(f\"Total celebrity mentions: {sum(celebrity_counts.values())}\")\n\n# Get the top celebrities\ntop_celebrities = celebrity_counts.most_common(50)\n\nprint(\"\\nTop 50 celebrities by frequency:\")\nfor i, (celebrity, count) in enumerate(top_celebrities, 1):\n    print(f\"{i:2d}. {celebrity}: {count}\")\n\n# Extract just the names for the final output\ntop_topics = [celebrity for celebrity, count in top_celebrities]\n\nprint(f\"\\nTop {len(top_topics)} topics extracted:\")\nprint(top_topics)"},{"iteration":5,"timestamp":"12:00:13","reasoning":"Perfect! I've successfully extracted and counted celebrity mentions from the entire 32M character dataset. I found 698 unique celebrities with a total of 1,128 mentions. The top celebrities include Emma Watson (15 mentions), Madison Beer (14 mentions), and Jama Jurabaev (13 mentions).\n\nI notice there are some potential duplicates in the list (like \"obama\" and \"barack obama\" both appearing), but for the purpose of this task, I'll return the raw results as they were extracted. The top_topics list contains 50 celebrities ranked by frequency, which should be sufficient for the output requirement.\n\nThe extraction was successful across all 41 chunks of data, and I have a good distribution of different types of celebrities - actors, politicians, artists, and other public figures. I'm ready to submit the final results.","code":"SUBMIT(top_topics)"}]};

    let turns = DATA.turns;
    let currentIndex = 0;

    const root = document.getElementById('rlm-viz-root');
    const container = root.querySelector('.rlm-container');
    const progress = document.getElementById('rlm-progress');
    const viewport = document.getElementById('rlm-viewport');
    const slide = document.getElementById('rlm-slide');
    const prevBtn = document.getElementById('rlm-prev');
    const nextBtn = document.getElementById('rlm-next');

    // ── Python syntax highlighting (Pygments-compatible colors) ──
    function highlightPython(code) {
      const el = document.createElement('div');
      let rest = code;
      const patterns = [
        { re: /^(f?"""[\s\S]*?""")/, cls: 'py-string' },
        { re: /^(f?'''[\s\S]*?''')/, cls: 'py-string' },
        { re: /^(#[^\n]*)/, cls: 'py-comment' },
        { re: /^(f"(?:[^"\\]|\\.)*")/, cls: 'py-string' },
        { re: /^(f'(?:[^'\\]|\\.)*')/, cls: 'py-string' },
        { re: /^("(?:[^"\\]|\\.)*")/, cls: 'py-string' },
        { re: /^('(?:[^'\\]|\\.)*')/, cls: 'py-string' },
        { re: /^(and|as|assert|async|await|break|class|continue|def|del|elif|else|except|finally|for|from|global|if|import|in|is|lambda|nonlocal|not|or|pass|raise|return|try|while|with|yield|True|False|None)\b/, cls: 'py-keyword' },
        { re: /^(print|len|range|str|int|float|list|dict|set|tuple|open|input|type|isinstance|hasattr|getattr|setattr|enumerate|zip|map|filter|sorted|reversed|sum|min|max|abs|round|any|all|format)\b/, cls: 'py-builtin' },
        { re: /^(llm_query|SUBMIT)\b/, cls: 'py-special' },
        { re: /^(\d+\.?\d*)/, cls: 'py-number' },
        { re: /^(@\w+)/, cls: 'py-decorator' },
      ];
      while (rest.length > 0) {
        let matched = false;
        for (const { re, cls } of patterns) {
          const m = rest.match(re);
          if (m) {
            const span = document.createElement('span');
            span.className = cls;
            span.textContent = m[0];
            el.appendChild(span);
            rest = rest.slice(m[0].length);
            matched = true;
            break;
          }
        }
        if (!matched) {
          el.appendChild(document.createTextNode(rest[0]));
          rest = rest.slice(1);
        }
      }
      return el;
    }

    // ── Render a single iteration ──
    function renderTurn(turn) {
      let html = '';

      // Iteration label
      html += `<div class="rlm-iter-label">
        <span class="rlm-iter-num">Iteration ${turn.iteration}</span>
        ${turn.timestamp ? `<span class="rlm-ts">${turn.timestamp}</span>` : ''}
      </div>`;

      // Reasoning (as markdown)
      if (turn.reasoning) {
        html += `<div class="rlm-reasoning">${marked.parse(turn.reasoning)}</div>`;
      }

      // Code
      html += `<div class="rlm-code" id="rlm-code-block"></div>`;

      slide.innerHTML = html;

      // Inject syntax-highlighted code via DOM
      const codeBlock = slide.querySelector('#rlm-code-block');
      if (codeBlock && turn.code) {
        codeBlock.appendChild(highlightPython(turn.code));
      } else if (codeBlock) {
        codeBlock.textContent = '# No code';
      }
    }

    // ── Navigation ──
    function goTo(index) {
      if (index < 0 || index >= turns.length) return;
      slide.classList.add('fading');
      setTimeout(() => {
        currentIndex = index;
        renderTurn(turns[currentIndex]);
        updateControls();
        slide.classList.remove('fading');
      }, 120);
    }

    function updateControls() {
      prevBtn.disabled = currentIndex === 0;
      nextBtn.disabled = currentIndex === turns.length - 1;

      const dots = progress.querySelectorAll('.rlm-dot');
      dots.forEach((dot, i) => {
        dot.classList.toggle('active', i === currentIndex);
        dot.classList.toggle('visited', i < currentIndex);
      });
    }

    function buildDots() {
      progress.innerHTML = '';
      turns.forEach((_, i) => {
        const dot = document.createElement('span');
        dot.className = 'rlm-dot';
        dot.addEventListener('click', () => goTo(i));
        progress.appendChild(dot);
      });
    }

    // ── Button clicks ──
    prevBtn.addEventListener('click', () => goTo(currentIndex - 1));
    nextBtn.addEventListener('click', () => goTo(currentIndex + 1));

    // ── Keyboard ──
    container.addEventListener('keydown', function(e) {
      if (e.key === 'ArrowLeft' || e.key === 'ArrowUp') {
        e.preventDefault();
        goTo(currentIndex - 1);
      } else if (e.key === 'ArrowRight' || e.key === 'ArrowDown') {
        e.preventDefault();
        goTo(currentIndex + 1);
      }
    });

    // ── Swipe support ──
    let touchStartX = 0;
    let touchStartY = 0;

    viewport.addEventListener('touchstart', function(e) {
      touchStartX = e.changedTouches[0].clientX;
      touchStartY = e.changedTouches[0].clientY;
    }, { passive: true });

    viewport.addEventListener('touchend', function(e) {
      const dx = e.changedTouches[0].clientX - touchStartX;
      const dy = e.changedTouches[0].clientY - touchStartY;
      if (Math.abs(dx) > 50 && Math.abs(dx) > Math.abs(dy) * 1.5) {
        if (dx < 0) goTo(currentIndex + 1);
        else goTo(currentIndex - 1);
      }
    }, { passive: true });

    // ── Initialize ──
    buildDots();
    renderTurn(turns[0]);
    updateControls();
  })();
  </script>
</div>

<p>We can clearly see the LLM exploring and sampling the context, planning an approach, testing the approach, scaling the approach, then finally synthesizing its findings into a final answer. (In this case, it was correct!)</p>

<p>The context I gave this RLM – the collection of Stable Diffusion prompts – exceeds the maximum context window of any LLM. It would fail before it started, whereas a DSPy RLM harness around Kimi K2 took only a couple minutes.</p>

<p>It’s incredible, but with this example we can identify a couple limitations of RLMs.</p>

<p>First, it’s relatively slow. Answering this question took over a dozen LLM calls and several minutes. And we were using <a href="https://groq.com/blog/introducing-kimi-k2-0905-on-groqcloud">Kimi K2 on Groq</a>. Try this with GPT-5.3 or Opus 4.6 and you’ll be waiting around even longer.</p>

<p>Second, as you read through the reasoning and code in the example above it becomes apparent that you need strong models to drive RLMs. <a href="https://huggingface.co/Qwen/Qwen3-30B-A3B">Qwen3-30B-A3B</a> couldn’t complete this task. It got confused, lost track of progress, and ended up running out of budget before submitting an answer<sup id="fnref:qwen"><a href="#fn:qwen" class="footnote" rel="footnote" role="doc-noteref">2</a></sup>.</p>

<p>This brings us to the second reason RLMs work so well (in addition to maintaining the two <em>token</em> and <em>programmatic</em> context pools): RLMs exploit the coding reasoning gains of the last +18 months.</p>

<p>We’ve covered before <a href="https://www.dbreunig.com/2024/12/18/synthetic-data-the-growing-ai-perception-divide.html">how LLMs are getting better at verifiable tasks because it’s relatively easy to synthesize data and evaluate verifiable tasks</a>, like math and coding. We’ve spent many billions of dollars post-training coding skills into frontier models. RLMs wrap long contexts in a coding environment so they’re addressable by the LLM’s incredible coding abilities, <em>turning context rot into a coding problem.</em></p>

<p>Even better, RLMs get to use the REPL not just as a tool for exploring and managing long contexts, but also as a deterministic scratchpad. This proves to be a killer resource for many tasks. You occasionally see this benefit in action in ChatGPT or Claude, when the LLM will fire up a Python script to answer a question<sup id="fnref:python"><a href="#fn:python" class="footnote" rel="footnote" role="doc-noteref">3</a></sup>. This hybrid capability of RLMs – the ability to use probabilistic, fuzzy LLM logic for some challenges and deterministic code for others – will likely become a stronger attribute as RLM harnesses mature and models are fine-tuned.</p>

<hr />

<h3 id="the-potential-of-rlms-agent-discovery-mechanisms">The Potential of RLMs: Agent Discovery Mechanisms</h3>

<p>The ability of RLMs to mitigate the effects of context rot are really incredible. However, this isn’t the potential that excites me most. What excites me about RLMs is their ability to explore, develop, and test approaches to solving a problem.</p>

<p>If you start experimenting with RLMs (and I <em>strongly</em> suggest you should), be sure to continually review your traces. Set <code class="language-plaintext highlighter-rouge">verbose</code> to <code class="language-plaintext highlighter-rouge">true</code> and/or wire up DSPy to <a href="https://mlflow.org/">MLFlow</a>. As you watch these models explore the context and try out different approaches (taking your iteration budget into consideration<sup id="fnref:kimi"><a href="#fn:kimi" class="footnote" rel="footnote" role="doc-noteref">4</a></sup>), you’ll notice repeating patterns. In the example above, if I asked the RLM to find the top celebrities, aesthetic styles, or vehicles requested in the image generation prompts, it would repeatedly deploy similar tactics to situate itself and complete the task.</p>

<p><em>There is no reason we can’t identify these repeating patterns, decompose them, and optimize them.</em></p>

<p>This is what excites me about RLMs: if you run them on the same task several times, you’re generating emergent agent designs. These traces can then be used to explicitly define an agent, with higher reliability and lower latency. RLM passes discover the best approach to the problem, which we can then optimize.</p>

<hr />

<h3 id="the-limitations-of-rlms">The Limitations of RLMs</h3>

<p>But if that’s the potential, how should you use RLMs today? In the last couple months I’ve seen teams use them for very large context scenarios, from general coding tasks across massive codebases to research and exploration across massive datasets.</p>

<p>At the moment, using RLMs on small context problems probably isn’t worth the squeeze. You’ll end up waiting around while the RLM explores context that could have simply been part of the prompt.</p>

<p>Further, RLMs do not solve other <a href="https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html">context fails</a>, like <em>context poisoning</em> or <em>context confusion</em>. If bad information is in your programmatic context, there’s good odds it could influence the RLM in undesirable ways.</p>

<hr />

<h3 id="the-next-chain-of-thought">The Next “Chain of Thought”?</h3>

<p>RLMs are slow, synchronous, and merely borrowing the current capabilities of models rather than leveraging models post-trained to be <em>good</em> at RLM patterns. There is <em>so</em> much low-hanging fruit here.</p>

<p>But that’s exactly what makes them exciting. <a href="https://www.dbreunig.com/2025/04/11/what-we-mean-when-we-say-think.html">Chain of thought</a> was also simple and general (just ask the model to “think step by step”) and it unlocked enormous latent potential in LLMs, that was only fully realized through the creation of reasoning models. RLMs have the same shape: a test-time strategy that’s easy to implement today and will only get better as models are trained to exploit it.</p>

<p>You probably don’t need to rush out and refactor your agents today. But if your agents touch large contexts, start experimenting with RLM traces today. You’ll learn something about your problem…and you might discover your next agent architecture in the output.</p>

<hr />

<form action="https://buttondown.com/api/emails/embed-subscribe/dbreunig" method="post" target="popupwindow" onsubmit="window.open('https://buttondown.com/dbreunig', 'popupwindow')" class="embeddable-buttondown-form">
  <label for="bd-email">Enter your email to receive the occasional update.</label>
  <div class="form-input">
    <input type="email" name="email" id="bd-email" placeholder="your@email.com" />
    <input type="submit" value="Subscribe" />
  </div>
</form>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:repl">
      <p>“REPL” stands for “read-eval-print loop”. It is an interactive coding environment where one can enter arbitrary code and get back output. If you open your terminal and type <code class="language-plaintext highlighter-rouge">python</code>, you’ll find yourself in a REPL. <a href="#fnref:repl" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:qwen">
      <p>The team at MIT behind RLM <a href="https://huggingface.co/mit-oasys/rlm-qwen3-8b-v0.1">has just released a version of Qwen3-8B post-trained on RLM traces</a>. I hear it works pretty well, but no amount of fine-tuning or RL is going to help Qwen-8B code or reason as well as GPT or Opus. <a href="#fnref:qwen" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:python">
      <p>Both ChatGPT and Claude used to do this when asked, “How many R’s are in Strawberry,” though it appears both rely on reasoning or, in the case of ChatGPT, hide the previously visible Python code. <a href="#fnref:python" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:kimi">
      <p>I was continually amazed how well models would leverage their budgets. Kimi, in particular, wasn’t shy about ending early if the task proved simple. But it would also spend LLM sub-calls freely once it had a working approach, saturating my connection with Groq. <a href="#fnref:kimi" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Drew Breunig</name></author><category term="RLMs" /><category term="AI" /><category term="Context" /><category term="DSPy" /><summary type="html"><![CDATA[Handling Your Long Context Today &amp; Designing Your Agent Tomorrow]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.dbreunig.com/img/thumb_index.jpg" /><media:content medium="image" url="https://www.dbreunig.com/img/thumb_index.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">The Rise of Spec Driven Development</title><link href="https://www.dbreunig.com/2026/02/06/the-rise-of-spec-driven-development.html" rel="alternate" type="text/html" title="The Rise of Spec Driven Development" /><published>2026-02-06T08:22:00-08:00</published><updated>2026-02-25T15:11:53-08:00</updated><id>https://www.dbreunig.com/2026/02/06/the-rise-of-spec-driven-development</id><content type="html" xml:base="https://www.dbreunig.com/2026/02/06/the-rise-of-spec-driven-development.html"><![CDATA[<p><img src="/img/free_puppies.jpg" alt="" /></p>

<p>It’s been a month since I launched <a href="https://www.dbreunig.com/2026/01/08/a-software-library-with-no-code.html"><code class="language-plaintext highlighter-rouge">whenwords</code></a>, and since then there’s been a flurry of experiments with <em>spec driven development</em> (SDD): using coding agents to implement software using only a detailed text <em>spec</em> and a collection of <a href="https://en.wikipedia.org/wiki/Conformance_testing">conformance tests</a>.</p>

<h3 id="github-could-use-a-docs-review-ui">Github Could Use a ‘Docs Review’ UI</h3>

<p>First off, despite <code class="language-plaintext highlighter-rouge">whenwords</code> being a couple Markdown docs and a YAML test set, people have submitted valuable PRs. <a href="https://github.com/mlafeldt">Mathias Lafeldt</a> spotted a disagreement about rounding, where the spec instructed the agent to round <em>up</em> in several scenarios, <a href="https://github.com/dbreunig/whenwords/commit/c7278227dac8688546da3740c7064af18a14a44a">but three tests were rounding down</a>. Others have suggested there should be some [CI][ci] (despite their being no code) and <a href="https://github.com/dbreunig/whenwords/issues/9">wonder what that should be</a>.</p>

<p>There’s been enough action on <a href="https://github.com/dbreunig/whenwords">the repo</a> to give us an idea of what open source collaboration could look like in a SDD world. And it feels more like commenting in and marking up a Google Doc than code merges. I would love to see Github lean into this and build richer Markdown review, like Word or Google Docs, allowing for easier collaboration and accessibility to a wider audience.</p>

<h3 id="emulation--porting-are-the-low-hanging-sdd-use-case">Emulation &amp; Porting are <em>the</em> Low-Hanging SDD Use Case</h3>

<p>By far, the hardest part of starting a SDD project is creating the tests. Which is why many developers are opting for borrowing existing test sets or deriving by referencing a source of truth.</p>

<p>Here’s a few examples:</p>

<ul>
  <li><a href="https://www.anthropic.com/engineering/building-c-compiler"><strong>Anthropic wrote a C compiler in Rust</strong></a>. They used <a href="https://gcc.gnu.org/onlinedocs/gccint/Torture-Tests.html">existing test suites</a> and used GCC as a source of truth for validation and generating new tests.</li>
  <li><a href="https://github.com/vercel-labs/just-bash"><strong>Vercel created a bash emulator in TypeScript</strong></a>. They created and curated an amazing set of <a href="https://github.com/vercel-labs/just-bash/tree/main/src/spec-tests">shell script spec tests</a> and <a href="https://x.com/cramforce/status/2015513111487553667?s=20">have been feeding these to Ralph</a>. (To make this even more meta, I’ve been following their commits and <a href="https://github.com/dbreunig/just-bash-py">Clauding them into Python</a>).</li>
  <li><a href="https://github.com/pydantic/monty"><strong>Pydantic created a Python emulator…in Python</strong></a>. This sounds silly, but it’s useful in the same way Vercel’s <code class="language-plaintext highlighter-rouge">just-bash</code> is: it’s a super lightweight sandbox for AI agents. (In fact, I’ve <a href="https://github.com/dbreunig/dspy-monty-interpreter">already wrapped it in a <code class="language-plaintext highlighter-rouge">CodeInterpretter</code></a> for use with DSPy’s <a href="https://alexzhang13.github.io/blog/2025/rlm/">RLM</a> module)</li>
</ul>

<p>Now… It’s worth noting that most of these examples didn’t emerge perfectly. Anthropic’s C-compiler just kinda punted on <a href="https://www.anthropic.com/engineering/building-c-compiler#:~:text=I%20tried%20(hard!)%20to%20fix%20several%20of%20the%20above%20limitations%20but%20wasn’t%20fully%20successful.%20New%20features%20and%20bugfixes%20frequently%20broke%20existing%20functionality.">the hard stuff</a> and admits the generated code is inefficient<sup id="fnref:clean"><a href="#fn:clean" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>. Pydantic’s Python emulator lacks <code class="language-plaintext highlighter-rouge">json</code>, <code class="language-plaintext highlighter-rouge">typing</code>, <code class="language-plaintext highlighter-rouge">sys</code>, and other standard libraries. Though I’m sure those will come soon. Vercel’s <code class="language-plaintext highlighter-rouge">just-bash</code> sports <em>outstanding</em> coverage, though <a href="https://x.com/mitsuhiko/status/2018452351900934433?s=20">people continue to find bugs</a>.</p>

<p>This is the big takeaway from watching the last few weeks of SDD: agents and a pile of tests can get you <em>really</em> far, <em>really</em> fast, but for complex software they can’t get you over the line. Edge cases will generate new tests, truly hard problems will resist SDD implementation, and architectural issues will prohibit parallelism agents.</p>

<p>Vercel’s CTO and <code class="language-plaintext highlighter-rouge">just-bash</code> creator, Malte Ubl, <a href="https://x.com/cramforce/status/2018159890381017419">sums it up best</a>:</p>

<p><img src="/img/puppies.jpg" alt="Software is free now. (Free as in puppies)" /></p>

<p>You can <a href="https://ghuntley.com/ralph/">Ralph</a> up a port or emulator in a weekend or two, but <em>now you have to take care of it</em>.</p>

<hr />

<form action="https://buttondown.com/api/emails/embed-subscribe/dbreunig" method="post" target="popupwindow" onsubmit="window.open('https://buttondown.com/dbreunig', 'popupwindow')" class="embeddable-buttondown-form">
  <label for="bd-email">Enter your email to receive the occasional update.</label>
  <div class="form-input">
    <input type="email" name="email" id="bd-email" placeholder="your@email.com" />
    <input type="submit" value="Subscribe" />
  </div>
</form>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:clean">
      <p>There is lots to pick apart in Anthropic’s piece (I have had multiple compiler and related people ping me about how misrepresentative it is), but the most laughable claim is that this is, “a clean-room implementation”. The idea that using an LLM trained on the entire internet, all of Github, and warehouses full of books is a clean room environment is absurd. <a href="#fnref:clean" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Drew Breunig</name></author><category term="OSS" /><category term="AI" /><category term="Spec Driven Development" /><category term="SDD" /><summary type="html"><![CDATA[]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.dbreunig.com/img/free_puppies.jpg" /><media:content medium="image" url="https://www.dbreunig.com/img/free_puppies.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">A Software Library with No Code</title><link href="https://www.dbreunig.com/2026/01/08/a-software-library-with-no-code.html" rel="alternate" type="text/html" title="A Software Library with No Code" /><published>2026-01-08T14:59:00-08:00</published><updated>2026-02-25T15:11:53-08:00</updated><id>https://www.dbreunig.com/2026/01/08/a-software-library-with-no-code</id><content type="html" xml:base="https://www.dbreunig.com/2026/01/08/a-software-library-with-no-code.html"><![CDATA[<p><img src="/img/ikea_instructions.jpg" alt="" /></p>

<h3 id="all-you-need-is-specs">All You Need is Specs?</h3>

<p>Today I’m releasing <a href="https://github.com/dbreunig/whenwords"><code class="language-plaintext highlighter-rouge">whenwords</code></a>, a relative time formatting library that contains <em>no code</em>.</p>

<p><code class="language-plaintext highlighter-rouge">whenwords</code> provides five functions that convert between timestamps and human-readable strings, like turning a UNIX timestamp into “3 hours ago”.</p>

<p>There are <em>many</em> libraries that perform similar functions. But none of them are language agnostic.</p>

<p><code class="language-plaintext highlighter-rouge">whenwords</code> supports Ruby, Python, Rust, Elixir, Swift, PHP, and Bash. I’m sure it works in other languages, too. <a href="https://github.com/dbreunig/whenwords-examples">Those are just the languages I’ve tried and tested</a>.</p>

<p>(I even implemented it as Excel formulas. Though that one requires a bit of work to install.)</p>

<p>But like I said: the <code class="language-plaintext highlighter-rouge">whenwords</code> library <em>contains no code</em>. Instead, <code class="language-plaintext highlighter-rouge">whenwords</code> contains specs and tests, specifically:</p>

<ul>
  <li><strong>SPEC.md</strong>: A detailed description of how the library should behave and how it should be implemented.</li>
  <li><strong>tests.yaml</strong>: A list of language-agnostic test cases, defined as input/output pairs, that any implementation must pass.</li>
  <li><strong>INSTALL.md</strong>: Instructions for building <code class="language-plaintext highlighter-rouge">whenwords</code>, for you, the human.</li>
</ul>

<p>The installation instructions are comically simple, just a prompt to paste into Claude, Codex, Cursor, whatever. It’s short enough to print here in its entirety:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Implement the whenwords library in [LANGUAGE].
<span class="p">
1.</span> Read SPEC.md for complete behavior specification
<span class="p">2.</span> Parse tests.yaml and generate a test file
<span class="p">3.</span> Implement all five functions: timeago, duration, parse_duration, 
   human_date, date_range
<span class="p">4.</span> Run tests until all pass
<span class="p">5.</span> Place implementation in [LOCATION]

All tests.yaml test cases must pass. See SPEC.md "Testing" section 
for test generation examples.
</code></pre></div></div>

<p>Pick your language, pick your location, copy, paste, and go.</p>

<hr />

<p>Okay. This is silly. But the more I play with it, the more questions and thoughts I have.</p>

<p>Recent advancements in coding agents are stunning. Opus 4.5 coupled with Claude Code isn’t perfect, but its ability to implement tightly specified code is uncanny. <a href="https://x.com/tobi/status/2009311882553765914">Models and their harnesses crossed a threshold in Q4</a>, and everyone I know using Opus 4.5 has felt it. There wasn’t a single language where Claude couldn’t implement <code class="language-plaintext highlighter-rouge">whenwords</code> in one shot. These capabilities are raising all sorts of questions, especially: <em>“What does software engineering look like when coding is free?”</em></p>

<p>I’ve chewed on this question a bit, but this “software library without code” is a tangible thought experiment that helped firm up a few questions and thoughts. Specifically:</p>

<p><strong>Do we still need 3rd party code libraries?</strong></p>

<p>There are <em>many</em> utility libraries that aim to perform similar functions, but exist as language-specific implementations. Do we need them all? Or do we need one, tightly defined set of rules which we implement on demand, according to the specific conventions of a given language and project? For libraries that are simple <em>utilities</em> (as opposed to complex <em>frameworks</em>), I think the answer might be, “Yes.”</p>

<p>Now, <code class="language-plaintext highlighter-rouge">whenwords</code> is (purposely) a very simple utility. It’s five functions, doesn’t require many dependencies, and depends on a well-defined standard (Unix time). It’s not an expensive operation, a poor implementation probably won’t be a bottleneck, and the written spec is only ~500 lines.</p>

<p>But there’s no reason we couldn’t get more complex. Well defined standards (like <a href="https://x.com/simonw/status/2009443591064859072">those you’d need to implement a browser</a>) can help you tackle complex bits of software relatively quickly. The question is: when does this model make sense and when <em>doesn’t it</em>?</p>

<p>Today, I see 5 reasons why you’d want libraries with code:</p>

<p><strong>1. When Performance Matters</strong></p>

<p>Let’s run with that browser example. There are well-defined, large specs for how to interpret HTML, JS, and CSS. One could push these further and deliver a spec-only browser.</p>

<p>But performance is going to be an issue. I want to open hundreds of tabs and not spring memory leaks. I want rendering to be quick, optimized to within an inch of what’s possible. I want a large group of users going out and encountering strange websites, buggy javascript, bad imports, and more. I want people finding these issues, fixing them, and memorializing them as code.</p>

<p><strong>2. When Testing is Complicated</strong></p>

<p>But Drew, you say, if we find performance issues in the spec-only browser we can just update the spec. That’s true, but testing updates gets complicated <em>fast</em>.</p>

<p>Let’s say you notice <code class="language-plaintext highlighter-rouge">whenwords</code> has a bug in its Elixir implementation. To fix the <code class="language-plaintext highlighter-rouge">whenwords</code> spec, you add a line to the <code class="language-plaintext highlighter-rouge">SPEC.md</code> file to prevent the Elixir bug. You submit a PR request and I’m able to verify it helps Claude build a working Elixir implementation.</p>

<p>But did the change screw up the other variants? Does <code class="language-plaintext highlighter-rouge">whenwords</code> still work for Ruby, Python, Bash, and Excel? Does it work for all of them when building with Claude and Codex? What about Qwen? Do we end up with a CI/CD pipeline that builds and tests our spec against 4 coding agents and 20 languages? Or do we just say, “Screw it,” and tell users they’re responsible for whatever code produced?</p>

<p>This isn’t a <em>huge</em> deal for a library with the scope of <code class="language-plaintext highlighter-rouge">whenwords</code>, but for anything moderately complex, the amount of surface area we’d want to test grows <em>quickly</em>. <code class="language-plaintext highlighter-rouge">whenwords</code> has 125 tests. For comparison, <a href="https://sqlite.org/testing.html">SQLite has 51,445 tests</a>. I’m not building on a spec-only implementation of a database.</p>

<p><strong>3. When You Need to Provide Support &amp; Bug Fixes</strong></p>

<p>Chasing down bugs is harder with spec-only libraries because failures are inconsistent.</p>

<p>Let’s imagine a future where we’re shipping enterprise software as a <a href="https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview">Claude Skill</a>, or some other similar prepared context that lets agents implement our software for our customers, depending on their environment. This is basically our “software library with no code” taken to an extreme. While there may be benefits here, there are also perils.</p>

<p>Replicating bugs is nearly impossible. If the customer gets stuck on an issue <em>with their own generated codebase</em>, how do we have a hope of finding the problem? Do we just iterate on our spec and add plenty of tests, toss it over to them, and ask them to rebuild the whole thing? Probably not. The models remain probabilistic and as our specs grow the likelihood of our implementations being significantly different grows.</p>

<p><strong>4. When Updates Matter</strong></p>

<p>A library I like is <a href="https://www.litellm.ai">LiteLLM</a>, an AI gateway that provides one interface to call many LLMs across multiple platforms. They add new models quickly, push updates to address connection issues with different platforms, and are generally very responsive.</p>

<p>Other foundational libraries (like nginx, Rails, Postgres) push essential security updates. These are dependencies I <em>wish</em> to maintain. Spec-only libraries, on the other hand, likely work best for implement-and-forget utilities and functions. When continual fixes, support, and security aren’t needed or aren’t valued.</p>

<p><strong>5. When Community &amp; Interoperability Matter</strong></p>

<p>Running through all the points above is <em>community</em>. Lots of users mean more bugs are spotted. More contributors mean more bugs are fixed. Comprehensive testing means PRs are accepted faster. A big community increases the odds someone is available to help. Community support means code is kept up-to-date.</p>

<p>When you want these things, you want community. The code we rely on is not just an instantiation of a spec (a tightly defined set of concepts, aims, and requirements), but the product of people and culture that crystallize around a goal. It’s the magic of open source; why it works and why I love it.</p>

<p>For the job <code class="language-plaintext highlighter-rouge">whenwords</code> performs, we don’t need to belong to a club. But for foundations, the things we want to build on, the community is essential because it delivers the points above. Sure, there may be instances of spec-only libraries created and maintained by a vibrant community. But I imagine there will continually be a reference implementation that codifies and ties the spec to the ground.</p>

<hr />

<p>But the above isn’t fully baked. Our models will get better, our agents more capable. And I’m sure the list above is not exhaustive. I’d enjoy hearing your thoughts on this one, <a href="https://www.dbreunig.com/contact.html">do reach out</a>.</p>

<hr />

<form action="https://buttondown.com/api/emails/embed-subscribe/dbreunig" method="post" target="popupwindow" onsubmit="window.open('https://buttondown.com/dbreunig', 'popupwindow')" class="embeddable-buttondown-form">
  <label for="bd-email">Enter your email to receive the occasional update.</label>
  <div class="form-input">
    <input type="email" name="email" id="bd-email" placeholder="your@email.com" />
    <input type="submit" value="Subscribe" />
  </div>
</form>]]></content><author><name>Drew Breunig</name></author><category term="OSS" /><category term="AI" /><category term="specs" /><summary type="html"><![CDATA[Do we still need libraries of 3rd party code when AI agents are this good?]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.dbreunig.com/img/ikea_instructions.jpg" /><media:content medium="image" url="https://www.dbreunig.com/img/ikea_instructions.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">2025 in Review: Jagged Intelligence Becomes a Fault Line</title><link href="https://www.dbreunig.com/2025/12/29/2025-in-review.html" rel="alternate" type="text/html" title="2025 in Review: Jagged Intelligence Becomes a Fault Line" /><published>2025-12-29T10:03:00-08:00</published><updated>2026-02-25T15:11:53-08:00</updated><id>https://www.dbreunig.com/2025/12/29/2025-in-review</id><content type="html" xml:base="https://www.dbreunig.com/2025/12/29/2025-in-review.html"><![CDATA[<p><img src="/img/fighting_a_centaur.jpg" alt="&quot;Theseus Fighting the Centaur Bianor&quot;, by Antoine-Louis Barye, 1867" /></p>

<h3 id="a-year-shaped-by-synthetic-data-dramatically-uneven-performance-and-reliability-issues">A year shaped by synthetic data, dramatically uneven performance, and reliability issues</h3>

<p>One of the reasons <a href="https://www.dbreunig.com/2025/12/27/why-i-write.html">why I write</a> is reflection. Looking over 2025’s work, there are consistent themes among the mess that help me understand the velocity of AI, its momentum and direction. I’m not going to polish this too much (if you want to dive in, check out the linked posts), but this exercise is quite clarifying to me.</p>

<p>Here’s the tl;dr:</p>

<ol>
  <li>Immediate AI risk comes from people over-estimating AI capabilities.</li>
  <li>Reliability and trust are the barriers preventing wide adoption.</li>
  <li>Evaluations remain underutilized.</li>
  <li>Synthetic data unlocked AI capabilities, but shapes its nature</li>
  <li>There is a growing AI perception gap between quantitative users and qualitative users.</li>
  <li>AI leaders are letting others define the story of AI.</li>
</ol>

<hr />

<h3 id="immediate-ai-risk-comes-from-people-over-estimating-ai-capabilities">Immediate AI risk comes from people over-estimating AI capabilities.</h3>

<p>There are many risks we should be conscious of, but the downsides that are biting us now come from people believing in AI capabilities or sentience that isn’t there. “<a href="https://www.dbreunig.com/2025/01/01/cyren.html">I don’t worry about superintelligent AGI’s taking over the world. I worry about bots convincing people they’re having an emotional connection when they’re not</a>.” This can be tied to teen suicides, senior scams, propagandist bots, and more. The natural language interface is wonderful for its flexibility and accessibility, but it exploits our evolutionary tendency to recognize humans where there are none.</p>

<p>This danger is more pronounced by our current human-in-the-loop design pattern. We’re asking laypeople to evaluate AI capabilities in fields which they <em>do not understand</em>. Too often I hear, “<a href="https://www.dbreunig.com/2025/04/08/on-ai-observational-comics.html">Chatbots know everything, but they make mistakes when it comes to things I know</a>.”</p>

<p><strong>Posts:</strong></p>

<ul>
  <li><a href="https://www.dbreunig.com/2025/01/01/cyren.html">Beware the Cyren’s Song</a></li>
  <li><a href="https://www.dbreunig.com/2025/04/08/on-ai-observational-comics.html">AI Chatbots Are Like Observational Comics</a></li>
</ul>

<hr />

<h3 id="reliability-and-trust-are-the-barriers-preventing-wide-adoption">Reliability and trust are the barriers preventing wide adoption.</h3>

<p>As we saw above, people can easily spot issues with AI when it’s working in their domain. Sure, we’ve come a <em>long</em> way this year, but these gains have mostly come from <a href="https://www.dbreunig.com/2024/10/18/the-3-ai-use-cases-gods-interns-and-cogs.html">Intern-style</a> applications. We keep the humans in the loop because humans are excellent at spotting and fixing issues the 10% of the time models flail.</p>

<p>But when that figure is higher than ~10% or so (these are finger-in-the-air numbers), people simply avoid the AI. Agents, especially custom enterprise ones, have <a href="https://www.dbreunig.com/2025/12/06/the-state-of-agents.html">a reliability problem</a> that hinders the development of the field. Teams that successfully ship agents do so by dialing back their complexity: chat interfaces, short tasks.</p>

<p>But we should consider reliability a means to an end; and that end is <em>trust</em>.</p>

<p>Trust is complex. It’s dependent on the task being done, the risk associated with the task, the UI that presents the task, and how the agent contextualizes the produced decision. <em>Reliability</em> can be measured at the model level, but <em>trust</em> has to be assessed end-to-end: from the model, to the application, to the user.</p>

<p>Frustratingly, there’s few good ways to measure trust in the AI era. We can do user interviews (and I know teams that do), but these are <em>slow</em>. UX research always has been, but their pace feels especially sluggish in the context of AI-powered development, Many teams can hack this by “vibe shipping” – making changes to their app, pushing to production, running a few queries, then repeating – basically doing the UX reseach by themselves, on themselves.</p>

<p>Everyone else should look to <em>delegation</em>. “<a href="https://www.dbreunig.com/2025/07/13/delegation-is-the-ai-metric-that-matters.html">Forget the benchmarks – the best way to track AI’s capabilities is to watch which decisions experts delegate to AI</a>.”</p>

<p><strong>Posts:</strong></p>

<ul>
  <li><a href="https://www.dbreunig.com/2025/07/13/delegation-is-the-ai-metric-that-matters.html">Delegation is the AI Metric that Matters</a></li>
  <li><a href="https://www.dbreunig.com/2025/09/15/ai-adoption-at-work-play.html">Enterprise AI Looks Bleak, But Employee AI Looks Bright</a></li>
  <li><a href="https://www.dbreunig.com/2025/12/06/the-state-of-agents.html">Enterprise Agents Have a Reliability Problem</a></li>
</ul>

<hr />

<h3 id="evaluations-remain-underutilized">Evaluations remain underutilized.</h3>

<p>At first I wrote, “under-appreciated.” But I think teams <em>get</em> why evaluations are valuable. The problem is most teams still don’t build them.</p>

<p>They get <a href="https://www.dbreunig.com/2025/01/08/evaluating-llms-as-knowledge-banks.html">the benefits</a>:</p>

<blockquote>
  <p>The real power of a custom eval isn’t just in model selection – it’s in the compound benefits it delivers over time. Each new model can be evaluated in hours, not weeks. Each prompt engineering technique can be tested systematically. And perhaps most importantly, your eval grows alongside your understanding of the problem space, becoming an increasingly valuable asset for your AI development.</p>
</blockquote>

<p>It used to be I had to argue that hand-tuned prompts would become overfit to a model. But OpenAI’s headline model deprecations this year pushed many teams to discover this empirically.</p>

<p>Despite this hiccup, many teams continue to push forward, hand-editing prompts and vibe shipping as they go. Pre-scale, this is likely optimal: the speed of iteration this allows is too valuable to ignore. As a result, <em>so many</em> teams I talk to who were previously focusing on evaluation tooling have pivoted to synthetic data creation or LLM-as-a-Judge services. Our AI capabilities have improved dramatically, but human behavior remains a constraint.</p>

<p><strong>Posts:</strong></p>

<ul>
  <li><a href="https://www.dbreunig.com/2025/01/08/evaluating-llms-as-knowledge-banks.html">Your Eval is More Important Than the Model</a></li>
  <li><a href="https://www.dbreunig.com/2025/01/30/mistral-small-human-centric-benchmarks.html">Mistral Small &amp; Human-Centric Benchmarks</a></li>
</ul>

<hr />

<h3 id="synthetic-data-unlocked-ai-capabilities-but-shapes-its-nature">Synthetic data unlocked AI capabilities, but shapes its nature.</h3>

<p>Investing in synthetic data creation unlocked AI capabilities in 2025. <a href="https://www.dbreunig.com/2024/12/18/synthetic-data-the-growing-ai-perception-divide.html">Rephrasing high quality content into reasoning and agentic chains kept the scaling party alive</a>. Generating <a href="https://www.dbreunig.com/2025/04/11/what-we-mean-when-we-say-think.html#the-strengths--limits-of-reasoning-models">new datasets for verifiable tasks</a> (like math and coding) helped AI coding apps evolve from better auto-complete services to async agents in less than a year.</p>

<p>Remember: Claude Code arrived in <em>February</em>.</p>

<p>Synthetic data did this. It provided the material needed for post-training, the mountains of examples necessary to upend an entire industry. But the limits of synthetic data, that it has been focused on quantitative tasks, <a href="https://www.dbreunig.com/2025/04/11/what-we-mean-when-we-say-think.html">greatly shapes our tools and discourse</a>:</p>

<blockquote>
  <p>Those who use AIs for programming will have a remarkably different view of AI than those who do not. The more your domain overlaps with testable synthetic data and RL, the more you will find AIs useful as an intern. This perception gap will cloud our discussions.</p>
</blockquote>

<p>The current solution, being deployed by frontier chatbots, is to treat everything they can as a programming problem. If ChatGPT or Claude can write a quick Python script to answer your question, it will. Context engineering challenges are being reframed as coding tasks: give a model a Python environment and let them explore, search, and read and write files. Yesterday’s harness is today’s environment. <a href="https://www.dbreunig.com/2025/12/19/how-model-use-has-changed-in-2025.html">In 2024 we called models, today we call systems</a>.</p>

<p>Scale was all we needed in 2024. Reasoning kept the party going in 2025. Coding will be the lever in 2026.</p>

<p><strong>Posts:</strong></p>

<ul>
  <li><a href="https://www.dbreunig.com/2025/01/20/on-authentic-data.html">On Authentic Data</a></li>
  <li><a href="https://www.dbreunig.com/2025/04/11/what-we-mean-when-we-say-think.html">What We Mean When We Say ‘Think’</a></li>
  <li><a href="https://www.dbreunig.com/2025/07/30/how-kimi-was-post-trained-for-tool-use.html">How Kimi K2 Became One of the Best Tool Using Models</a></li>
  <li><a href="https://www.dbreunig.com/2025/12/19/how-model-use-has-changed-in-2025.html">How Model Use Has Changed in 2025</a></li>
</ul>

<hr />

<h3 id="there-is-a-growing-ai-perception-gap-between-quantitative-users-and-qualitative-users">There is a growing AI perception gap between quantitative users and qualitative users.</h3>

<p>And this is the trillion dollar question: can we replicate our coding gains in qualitative fields? Can we generate synthetic data that unlocks better writing? Can we turn PowerPoint creation into a coding exercise? If we give GPT-5.2 a Python notebook can it write a better poem?</p>

<p>If these things can’t be solved with coding, there will be tremendous opportunity to improve the qualitative performance of models through other means. Doing so, however, will likely require solutions that <a href="https://www.dbreunig.com/2025/08/04/the-rise-of-opinionated-models.html">are opinionated rather than general</a>. Aesthetic performance requires subjective choices, not objective <em>correctness</em>.</p>

<p>But for now, the lopsided nature of today’s models is creating a world where programmers experience a <em>very</em> different AI than most ChatGPT users. The divide in capabilities between a free ChatGPT or Copilot account and Claude Code with Opus 4.5 is <em>vast</em>. Public conversations about AI are <em>deeply</em> unproductive because what you and I are experiencing is lightyears beyond the default experience.</p>

<p><strong>Posts:</strong></p>

<ul>
  <li><a href="https://www.dbreunig.com/2025/08/08/how-ai-coding-changes-product.html">AI-Assisted Coding Will Change Product Management</a></li>
  <li><a href="https://www.dbreunig.com/2025/04/11/what-we-mean-when-we-say-think.html">What We Mean When We Say ‘Think’</a></li>
  <li><a href="https://www.dbreunig.com/2025/04/18/the-wisdom-of-artificial-crowds.html">The Wisdom of Artificial Crowds</a></li>
  <li><a href="https://www.dbreunig.com/2025/07/31/how-kimi-rl-ed-qualitative-data-to-write-better.html">How Kimi K2 RL’ed Qualitative Data to Write Better</a></li>
  <li><a href="https://www.dbreunig.com/2025/08/04/the-rise-of-opinionated-models.html">FLUX.1-Krea &amp; the Rise of Opinionated Models</a></li>
</ul>

<hr />

<h3 id="ai-leaders-are-letting-others-define-the-story-of-ai">AI leaders are letting others define the story of AI.</h3>

<p>Compounding this problem is the fact that AI leaders aren’t even attempting to explain <em>how</em> AI works to the masses. <a href="https://x.com/dbreunig/status/2004319480386924889">I recently wrote</a>:</p>

<blockquote>
  <p>The AI ecosystem is repeating digital advertising’s critical mistake.</p>
</blockquote>

<blockquote>
  <p>One of the reasons the open online advertising ecosystem fell apart is because they terribly communicated how it all worked. The benefits of cross targeting were brushed over, because it was hard and complex to explain, and that left the door open for others to make privacy the <em>only</em> story, until it was too late. Which created the environment we have now, where most quality media is paywalled and only the giant platforms have sufficient scale for effective targeting.</p>
</blockquote>

<blockquote>
  <p>The AI industry is failing to explain how AI works. People and companies either brush it aside as complex and/or oversimplify it with over-promised metaphors (“A PHD in your pocket!”) These same people then get upset when critics keep wringing their hands about hallucinations, financial engineering, power and water consumption, and much more.</p>
</blockquote>

<p>AI leaders don’t invest in explanations because AI is hard to explain. Further, they’re incentivized to over-simplify and over-promise. Combine this withthe lightning speed of development (even <a href="https://x.com/karpathy/status/2004607146781278521">Karpathy feels left behind</a>!) and AI’s <a href="https://x.com/karpathy/status/1816531576228053133?lang=en">jagged intelligence</a> becomes into a fault line, threatening to rupture.</p>

<p><strong>Posts:</strong></p>

<ul>
  <li><a href="https://www.dbreunig.com/2025/01/31/deepseek-as-a-power-object.html">DeepSeek as a Power Object</a></li>
</ul>

<hr />

<form action="https://buttondown.com/api/emails/embed-subscribe/dbreunig" method="post" target="popupwindow" onsubmit="window.open('https://buttondown.com/dbreunig', 'popupwindow')" class="embeddable-buttondown-form">
  <label for="bd-email">Enter your email to receive the occasional update.</label>
  <div class="form-input">
    <input type="email" name="email" id="bd-email" placeholder="your@email.com" />
    <input type="submit" value="Subscribe" />
  </div>
</form>]]></content><author><name>Drew Breunig</name></author><category term="2025" /><category term="AI" /><summary type="html"><![CDATA[Looking back on 2025, the incredible pace of AI is stunning. But fast growth brings disconnects.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.dbreunig.com/img/fighting_a_centaur.jpg" /><media:content medium="image" url="https://www.dbreunig.com/img/fighting_a_centaur.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Why I Write (And You Should Too!)</title><link href="https://www.dbreunig.com/2025/12/27/why-i-write.html" rel="alternate" type="text/html" title="Why I Write (And You Should Too!)" /><published>2025-12-27T10:24:00-08:00</published><updated>2026-02-25T15:11:53-08:00</updated><id>https://www.dbreunig.com/2025/12/27/why-i-write</id><content type="html" xml:base="https://www.dbreunig.com/2025/12/27/why-i-write.html"><![CDATA[<p>Every now and then, people ask me why I write. I don’t get paid to write here, so it’s not immediately obvious why I keep writing.</p>

<p>I think writing is one of the most valuable things you can do, and I recommend everyone try it. Here’s why:</p>

<ol>
  <li><strong>It makes you a better thinker and communicator.</strong> Writing is a muscle. The more you write, the easier it gets, and your ability improves. You’ll learn to make clearer arguments, crisper explanations, and better empathize with your audience. These skills are applicable to everything.</li>
  <li><strong>You’ll get feedback that makes you a better writer.</strong> Feedback exposes weak arguments and strengthens the good ones. Plus, learning to listen to feedback is another skill that is universally applicable.</li>
  <li><strong>You’ll meet people interested in the same things you are.</strong> Looking through my correspondence, it’s amazing how many of my favorite people to chat with I met through writing online. (BTW, this goes both ways. If you read something that resonates with you online, write them a note thanking them and telling them what you liked!)</li>
  <li><strong>Your past thinking will be archived and searchable.</strong> This is more valuable than you think. If you invest time to hone a piece, you’ll turn back to it more often than you’d expect. Further, reviewing old pieces and threads over time will reveal what worked and what didn’t while making your progress tangible.</li>
  <li><strong>The value of your writing compounds.</strong> The value <em>to you</em>, that is. I don’t think my pieces from 6 years ago are improving anyone’s life, but the contacts I’ve made and pieces I’ve crafted have grown into a foundation I get to leverage everyday.</li>
  <li><strong>Writing gives you a license to explore and organize your thoughts.</strong> This is the fun bit. Chasing down an idea that interests you, forming questions and then investigating them; it’s a joy. The second most common question I get about my writing is, “How do you motivate yourself to write?” This is the answer. There are so many drafts that live, dormant in my draft folder. So many times I start a piece and lose interest. And then: something will <em>click</em> and I’ll draft, investigate, and finish a piece in an hour (here’s <a href="https://www.dbreunig.com/2025/12/19/how-model-use-has-changed-in-2025.html">two</a> <a href="https://www.dbreunig.com/2025/10/20/speeds-and-swarms.html">examples</a>). These aren’t always the most substantive pieces, but they keep the practice going and the momentum up.</li>
</ol>

<p>It’s hard to form new habits. But writing is s the best investment you can make <em>today</em>. Here’s a few tips for getting started:</p>

<ol>
  <li><strong>Be okay with bad writing.</strong> Most writing isn’t great! If my hit rate is 1 out of 5, I’m <em>thrilled</em>. Get comfortable publishing things that aren’t perfect. I know many people who wait too long to publish and, well, never do. They do this for <em>years</em>. If they’d gotten the ball rolling back then, they’d be better writers today. It’s weird: you’d think regular private writing would be sufficient to get better. But it isn’t. There’s no stakes. No feedback. The only way to get better is to ship. Some people worry about the risk of bad writing. I think the biggest risk comes from being an asshole (so don’t be an asshole!) But the actual risks are quite low: most bad writing is neutral, it remains unread.</li>
  <li><strong>You need to do the writing. Not AI.</strong> Writing is exercise. If I brought a forklift to the gym and used it to lift weights, what would be the point?</li>
  <li><strong>But AI is a wonderful editor.</strong> When you’re getting started, it’s intimidating to ask people for feedback on drafts. Thankfully, AI is great at this! Paste in your draft and prompt it with something like, “This is a blog draft where I am trying to argue X, read the piece and identify any spelling or grammar errors, places where I am not being clear or where a reader might be confused, or areas where my argument is weak.” Take it with a grain of salt, but this is usually very, very helpful.</li>
  <li><strong>Don’t overthink where to publish.</strong> Make pages public on Notion. Use Github pages. Use Substack, if you must. The only thing you cannot omit is an easy contact form and a way for someone to subscribe. I screwed this up for too long. RSS is not sufficient. Comments don’t count (in fact, turn them off). Let people email you with a form, one-on-one. The other thing to keep in mind is to pick something with low friction. If it takes too many steps to create a new post, you won’t.</li>
</ol>

<p>I hope you start a blog this year. Or revive an old one.</p>

<p>If you’d like some further advice, <a href="https://www.dbreunig.com/contact.html">feel free to reach out</a>!</p>

<hr />

<form action="https://buttondown.com/api/emails/embed-subscribe/dbreunig" method="post" target="popupwindow" onsubmit="window.open('https://buttondown.com/dbreunig', 'popupwindow')" class="embeddable-buttondown-form">
  <label for="bd-email">Enter your email to receive the occasional update.</label>
  <div class="form-input">
    <input type="email" name="email" id="bd-email" placeholder="your@email.com" />
    <input type="submit" value="Subscribe" />
  </div>
</form>]]></content><author><name>Drew Breunig</name></author><category term="writing" /><summary type="html"><![CDATA[Publicly writing is one of the most valuable things you can do. It improves your thinking, ability to communicate, and connects you to interesting people.]]></summary></entry><entry><title type="html">How Model Use Has Changed in 2025</title><link href="https://www.dbreunig.com/2025/12/19/how-model-use-has-changed-in-2025.html" rel="alternate" type="text/html" title="How Model Use Has Changed in 2025" /><published>2025-12-19T11:59:00-08:00</published><updated>2026-02-25T15:11:53-08:00</updated><id>https://www.dbreunig.com/2025/12/19/how-model-use-has-changed-in-2025</id><content type="html" xml:base="https://www.dbreunig.com/2025/12/19/how-model-use-has-changed-in-2025.html"><![CDATA[<h3 id="from-naked-model-endpoints-to-tool-using-reasoning-environment-endpoints">From ‘Naked’ Model Endpoints to Tool-Using, Reasoning Environment Endpoints</h3>

<p>I was poking around <a href="https://github.com/BerriAI/litellm">LiteLLM’s Github repository</a> and stumbled upon an interesting file. <a href="https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json"><code class="language-plaintext highlighter-rouge">model_prices_and_context_window.json</code></a> is a registry of all the models and inference providers you can call with LiteLLM. This is the core value of <a href="https://www.litellm.ai">LiteLLM</a>, wrapping this diverse array of models behind a consistent yet capable API, allowing applied AI builders to swap out models and providers without a major code rewrite.</p>

<p>This registry file is impressive, and well communicates the value of LiteLLM. It’s over 30,000 lines detailing over 2,000 model and provider combinations. At the top of the JSON file, LiteLLM provides a <code class="language-plaintext highlighter-rouge">sample_spec</code>, their schema for the information they store for each model. Curious, I poked into the repository’s commit history to see how this schema has evolved over the months.</p>

<p>And boy if this isn’t the story of LLMs in 2025:</p>

<p><img src="/img/litellm_schema_2025.jpg" alt="" /></p>

<p>On the left is the schema on January 1st, 2025. On the right is the schema today. The orange lines were added in 2025. The schema has doubled in size, as more and more tools and logic has been embedded in models and their providers. We aren’t just asking for text completion or chat, a good chunk of us are now hitting a single endpoint that can execute code, use a computer, manipulate files, and search the web. These types of calls are being made to an appliance, not a function, complete with its own environment to complete a task.</p>

<p>2025 may not have been the year of the agent, but perhaps it was the year of the tool.</p>

<p>Now, of course, this isn’t everyone. Such an appliance is essentially a blackbox that is difficult to eek reliability out of, if your agent or application is struggling. We still have and use ‘naked’ inference calls all the time.</p>

<p>But for human-in-the-loop chat apps, the surface area of what happens behind a model call is growing in size and structure.</p>]]></content><author><name>Drew Breunig</name></author><category term="ai" /><category term="2025" /><category term="tools" /><summary type="html"><![CDATA[We aren't just asking for text completion or chat, a good chunk of us are now hitting a single endpoint that can execute code, use a computer, manipulate files, and search the web.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.dbreunig.com/img/litellm_schema_2025.jpg" /><media:content medium="image" url="https://www.dbreunig.com/img/litellm_schema_2025.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Enterprise Agents Have a Reliability Problem</title><link href="https://www.dbreunig.com/2025/12/06/the-state-of-agents.html" rel="alternate" type="text/html" title="Enterprise Agents Have a Reliability Problem" /><published>2025-12-06T09:39:00-08:00</published><updated>2026-02-25T15:11:53-08:00</updated><id>https://www.dbreunig.com/2025/12/06/the-state-of-agents</id><content type="html" xml:base="https://www.dbreunig.com/2025/12/06/the-state-of-agents.html"><![CDATA[<p><img src="/img/reliable.jpg" alt="" /></p>

<h3 id="enterprise-agents-struggle-to-reach-production-or-find-adoption-due-to-reliability-concerns">Enterprise agents struggle to reach production or find adoption due to reliability concerns</h3>

<p>Throughout 2025, there’s been a steady drumbeat of reports on the state of AI in the enterprise. On the surface, many appear to disagree. But dig in a little bit, look at how each report was assembled and how they defined their terms and you’ll find a consistent story: <strong>adoption of 3rd party AI apps is surging while 1st party development struggles to find success</strong>.</p>

<p>If you’re short on time, here’s the tl;dr:</p>

<ol>
  <li>Off-the-shelf AI tools are widely used and valued within the enterprise. (<a href="https://knowledge.wharton.upenn.edu/special-report/2025-ai-adoption-report/">Wharton/GBK’s AI Adoption Report</a>)</li>
  <li>But internal AI pilots fail to earn adoption. (<a href="https://docs.google.com/forms/d/e/1FAIpQLSc8rU8OpQWU44gYDeZyINUZjBFwu--1uTbxixK_PRSVrfaH8Q/viewform">MIT NANDA’s report</a>)</li>
  <li>Very few enterprise agents make it past the pilot stage into production. (<a href="https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai#/">McKinsey’s State of AI</a>)</li>
  <li>To reach production, developers compromise and build simpler agents to achieve reliability. (<a href="https://arxiv.org/abs/2512.04123v1">UC Berkeley’s MAP</a>)</li>
</ol>

<p>The few custom agents that make it past the gauntlet figure out how to achieve reliability, earn employee trust, and <em>actually find usage</em>. Reliability is the barrier holding back agents, and right now the best way to achieve it is scaling back ambitions.</p>

<hr />

<p>Let’s start with <a href="https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/">the notorious MIT NANDA report</a> which generated the headline, “<a href="https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/">95% of generative AI pilots at companies are failing</a>.”</p>

<p>Plenty have <a href="https://www.futuriom.com/articles/news/why-we-dont-believe-mit-nandas-werid-ai-study/2025/08">criticized the methodology and conclusions NANDA reaches</a>, but I tend to believe most of the claims in <a href="https://docs.google.com/forms/d/e/1FAIpQLSc8rU8OpQWU44gYDeZyINUZjBFwu--1uTbxixK_PRSVrfaH8Q/viewform">the report</a> provided we keep in mind <em>who</em> was surveyed and understand that “AI pilots” were defined as <em>internally developed applications</em>. Keep this in mind as you review the following two figures:</p>

<p><img src="/img/nanda_chart.jpg" alt="MIT NANDA's study finds that business leaders can't get employees to adopt internal AI tools. Meanwhile, employees regularly use LLMs elsewhere." /></p>

<p><a href="https://www.dbreunig.com/2025/09/15/ai-adoption-at-work-play.html">I wrote in September</a>:</p>

<blockquote>
  <p>For all the criticism of the NANDA report, it is a survey of many business leaders. We can treat it as such. So while we might take that 95% figure with a grain of salt, we can trust that business leaders believe the biggest reason their AI pilots are failing is because their employees are unwilling to adopt new tools… While 90% of employees surveyed eagerly use AI tools they procure themselves.</p>
</blockquote>

<p>Internal applications struggle, while employee-driven use of ChatGPT and Claude is booming.</p>

<hr />

<p>Wharton and GBK’s annual <a href="https://knowledge.wharton.upenn.edu/special-report/2025-ai-adoption-report/">AI adoption report</a> appears to counter NANDA with claims that, “AI is becoming deeply integrated into modern work.” 82% of enterprise leaders use Gen AI weekly and 89% “believe Gen AI augments work.”</p>

<p>The Wharton report is an interesting read that details how people are using AI tools throughout their workday. But these are overwhelmingly 3rd party tools:</p>

<p><img src="/img/wharton_ai_25_02.jpg" alt="Off-the-shelf chatbot tools dominate enterprise AI usage, according to Wharton's annual survey." /></p>

<p>ChatGPT, Copilot, and Gemini dominate usage (Claude ranks surprisingly low, likely a function of Wharton’s respondent base). Custom chatbots see less usage than ChatGPT, and even then: the “by/for” in “built specifically by/for my organization” is doing a lot of work.</p>

<p>10 slides later, the report states (emphasis mine), “Customized Gen AI Solutions <em>May</em> be Coming as Internal R&amp;D Reaches One-Third of Tech Budgets.” The money is being deployed, but customized AI has yet to arrive at scale.</p>

<hr />

<p>Though they appear to disagree, both reports support a common conclusion: <strong>adoption of off-the-shelf tools is growing and valued, but companies struggle to build their own AI tools</strong>. Every enterprise AI report I read brings this reality further into focus.</p>

<p>Google Cloud’s “<a href="https://services.google.com/fh/files/misc/google_cloud_ai_trends.pdf">AI Business Trends</a>” report says agents are being widely used… But their definition of “agent” includes ChatGPT, CoPilot, and Claude.</p>

<p>McKinsey’s “<a href="https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai#/">State of AI</a>” doesn’t include off-the-shelf tools in their survey, and &lt;10% of respondents report having agents beyond the pilot stage.</p>

<p><img src="/img/mckinsey_ai_2025.jpg" alt="Less than 10% of respondents report internal AI tools beyond the pilot phase, according to McKinsey." /></p>

<hr />

<p>So why is it hard for enterprises to build AI tools? In short: <strong>reliability</strong>.</p>

<p>“<a href="https://arxiv.org/abs/2512.04123v1">Measuring Agents in Production</a>”, recent research led by Melissa Pan, brings this to life by surveying over 300 teams who actually have agents in production. The headline?</p>

<blockquote>
  <p>Reliability remains the top development challenge, driven by difficulties in ensuring and evaluating agent correctness.</p>
</blockquote>

<p>Rather than develop technical innovations to address this issue, developers dial down their agent ambitions and adopt simple methods and workflows. Most use off-the-shelf large models, with no fine-tuning, and hand-tuned prompts. Agents have short run-times, with 68% of agents executing fewer than 10 steps before requiring human intervention. Chatbot UX dominates, because it keeps a human in the loop: 92.5% of in-production agents deliver their output to humans, not to other software or agents. Pan writes, “<strong>Organizations deliberately constrain agent autonomy to maintain reliability</strong>.”</p>

<p><img src="/img/map_charts.png" alt="Agents in production use shorter prompts and few steps." /></p>

<p>This aligns with data released by OpenRouter this week, in their “<a href="https://openrouter.ai/state-of-ai">State of AI</a>” report. This report analyzed ~100 trillion tokens passing through OpenRouter, using a projection technique to categorize them by use case.</p>

<p>Prompt and sequence<sup id="fnref:sequence"><a href="#fn:sequence" class="footnote" rel="footnote" role="doc-noteref">1</a></sup> lengths are steadily growing for programming use cases, while all other categories remain stagnant:</p>

<p><img src="/img/openrouter_programming_vs_world.jpg" alt="LLM prompt complexity is stagnant, except for coding agents, according to OpenRouter." /></p>

<p>The figures above nicely support Pan’s conclusion that agent builders are keeping their agents simple and short to achieve reliability. Outside of coding agents (whose outlier success is a worth a separate discussion), prompts and agent sequence complexity is stagnant.</p>

<p>And these are the agents that make it into production! MIT NANDA showed that leaders say employee “unwillingness to adopt new tools” is the top barrier facing AI pilots. Pan’s results suggest a more sympathetic explanation: <em>when tools are unreliable, employees don’t adopt them</em>. They’re not stubborn; they’re rational.</p>

<p>In the short term, successful teams will build agents with constrained scope, earn trust, then expand. Delivering on bigger ambitions means building and sharing better tools for reliable AI engineering.</p>

<hr />

<form action="https://buttondown.com/api/emails/embed-subscribe/dbreunig" method="post" target="popupwindow" onsubmit="window.open('https://buttondown.com/dbreunig', 'popupwindow')" class="embeddable-buttondown-form">
  <label for="bd-email">Enter your email to receive the occasional update.</label>
  <div class="form-input">
    <input type="email" name="email" id="bd-email" placeholder="your@email.com" />
    <input type="submit" value="Subscribe" />
  </div>
</form>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:sequence">
      <p>“Sequence length is a proxy for task complexity and interaction depth.” <a href="#fnref:sequence" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Drew Breunig</name></author><category term="ai" /><category term="agents" /><category term="enterprise" /><category term="2025" /><summary type="html"><![CDATA[Every enterprise AI report tells the same story once you look past the headlines: chatbot adoption is booming, internal pilots are failing. Reliability is the barrier holding agents back.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.dbreunig.com/img/reliable.jpg" /><media:content medium="image" url="https://www.dbreunig.com/img/reliable.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Don’t Fight the Weights</title><link href="https://www.dbreunig.com/2025/11/11/don-t-fight-the-weights.html" rel="alternate" type="text/html" title="Don’t Fight the Weights" /><published>2025-11-11T08:33:00-08:00</published><updated>2026-02-25T15:11:53-08:00</updated><id>https://www.dbreunig.com/2025/11/11/don-t-fight-the-weights</id><content type="html" xml:base="https://www.dbreunig.com/2025/11/11/don-t-fight-the-weights.html"><![CDATA[<p><img src="/img/fighting_a_centaur.jpg" alt="&quot;Theseus Fighting the Centaur Bianor&quot;, by Antoine-Louis Barye, 1867" /></p>

<p>For the first year or so, one of the most annoying problems faced by building with AI was getting them to generate output with consistent formatting. Go find someone who was working with AI in 2023 and ask them what they did to <em>try</em> to get LLMs to consistently output JSON. You’ll get a thousand-yard stare before hearing about all-caps commands, threats towards the LLM, promises of <em>bribes</em> for the LLM, and (eventually) resorting to regular expressions.</p>

<p>Today, this is mostly a solved problem, but the <em>cause</em> of this issue remains, frustrating today’s context engineers. It’s a context failure I missed in my <a href="https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html">original list</a>. I call it <strong>Fighting the Weights</strong>: when the model won’t do what you ask because you’re working against its training.</p>

<hr />

<p>In 2020, OpenAI unveiled GPT-3 alongside a key paper: “<a href="https://arxiv.org/abs/2005.14165">Language Models are Few-Shot Learners</a>.” In this paper, OpenAI researchers showed that LLMs as large as GPT-3 (10x larger than previous language models) could perform tasks when provided with only a few examples. At the time, this was earth-shaking.</p>

<p>Pre-GPT-3, language models were only useful after they’d been fine-tuned for specific tasks; after their <em>weights</em> had been modified. But GPT-3 showed that with enough scale, LLMs could be problem-solving generalists if provided with a few examples. In OpenAI’s paper they coined the term “<strong>in-context learning</strong>” to describe an LLM’s ability to perform new types of tasks using examples and instructions contained in the prompt.</p>

<p>Today, <strong>in-context learning</strong> is a standard trick in any context engineer’s toolkit. Provide a few examples illustrating what you want back, given an input, and trickier tasks tend to get more reliable. They’re especially helpful when we need to induce a specific format or style or convey a pattern that’s difficult to explain<sup id="fnref:claude"><a href="#fn:claude" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>.</p>

<p>When you’re not providing examples, you’re relying on the model’s inherent knowledge base and weights to accomplish your task. We sometimes call this “<strong>zero-shot prompting</strong>” (as opposed to <em>few</em> shot<sup id="fnref:shot"><a href="#fn:shot" class="footnote" rel="footnote" role="doc-noteref">2</a></sup>) or “<strong>instruction-only prompting</strong>”.</p>

<p>In general, prompts fall into these two buckets:</p>

<ol>
  <li><strong>Zero-Shot or Instruction-Only Prompting:</strong> You provide instructions <em>only</em>. You’re asking the model to apply knowledge and behavioral patterns that are encoded in its weights. If this produces unreliable results, you might use…</li>
  <li><strong>Few-Shot or In-Context Learning:</strong> You provide instructions <em>plus examples</em>. You’re demonstrating a new behavioral pattern for the model to apply. The examples in the context <em>augment</em> the weights, providing them with details for a task it hasn’t seen.</li>
</ol>

<p>But there’s a third case: when the model <em>has</em> seen examples of the behavior you’re seeking, but it’s been trained to do the opposite of what you want. This is <em>worse</em> than the model having no knowledge of a pattern, because what it knows is at odds with your goal.</p>

<p>I call this <strong>fighting the weights</strong>.</p>

<p>Here’s some ways we end up fighting the weights:</p>

<ul>
  <li><strong>Format Following:</strong> You want the model to output only JSON, but often it will provide some text explaining the JSON and wrap the JSON in Markdown code blocks. This happens because the model’s post-training taught it to be conversational. When ChatGPT first launched, this problem was <em>rough</em>. GPT-3.5 had been heavily trained by humans to converse in a friendly, explanatory manner. So it did – even when you asked it not to. This doesn’t happen as much as it used to, but we’ll occasionally run into this issue when using unique formats or when using smaller models.</li>
  <li><strong>Tool Usage Formatting:</strong> As model builders start training their models to use tools, via reinforcement learning, they select specific formats and conventions. If your environment doesn’t follow these conventions, the model often fails to call tools correctly. I first noticed this while testing Mistral’s <a href="https://huggingface.co/mistralai/Devstral-Small-2505">Devstral-Small</a>, which was <a href="https://huggingface.co/mistralai/Devstral-Small-2505/discussions/9">trained with the tool-calling format</a> <a href="https://openhands.dev">All Hands</a> uses. When I tried to use Devstral with <a href="https://cline.bot">Cline</a>, it failed basic tasks. Last month this came up when a friend was trying Kimi K2 with a DSPy pipeline. By default, DSPy formats prompts with a <a href="https://dspy.ai/api/adapters/ChatAdapter/">Markdown-style template</a>. When this pipeline was driven by K2, formatting failed. Thanks to my recent <a href="https://www.dbreunig.com/2025/07/30/how-kimi-was-post-trained-for-tool-use.html">dive into how Moonshine trained K2 to use tools</a>, I knew K2 was trained with XML formatting. Switching DSPy to XML formatting solved the problem instantly.</li>
  <li><strong>Tone Changes:</strong> It’s really hard to apply consistent tone instructions to LLMs. Sure, we can make them talk like a pirate or in pig-latin, but subtle notes are overwhelmed by the model’s conversational post-training. For example, here’s the one note I give Claude in my settings: “Don’t go out of your way to patronize me or tell me how great my ideas are.” This does <em>not</em> stop Claude from replying with cloying phrases like, “Great idea!” when I suggest changes.</li>
  <li><strong>Overactive Alignment:</strong> Speaking of Claude: I appreciate Anthropic’s concern for alignment and safety in their models, but these guardrails can be overzealous. A recent example comes from Armin Ronacher, <a href="https://x.com/mitsuhiko/status/1986833561287024897">who tried several different approaches to get Claude Code to modify a medical form PDF while debugging PDF editing software</a>. Armin asked several different ways, but Claude’s post-training alignment refused to budge.</li>
  <li><strong>Over Relying On Weights:</strong> Models are trained to utilize the knowledge encoded in their weights. But there are many times when you want them to <em>only</em> answer with information provided in the context. Perusing <a href="https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools">leaked system prompts</a>, you can see how many instructions each chatbot maker gives when it comes to <em>when</em> models should search to obtain more info. The models have been trained to use their weights, so plenty of reiteration and examples are needed. This problem is especially tricky when building RAG systems, when the model should only form answers based on information obtained from specific databases. Companies like <a href="https://contextual.ai">Contextual</a> end up having to fine-tune their models to ensure they only answer with fetched information.</li>
</ul>

<p>Perhaps my favorite example I’ve seen was from ChatGPT. Previously, you could turn on the web inspector in your browser and watch the LLM calls fly by as you used the chatbot. This was handy for seeing when additional messages were added, that you didn’t write. When you asked ChatGPT to generate an image, it would clean up or <a href="https://www.dbreunig.com/2025/03/16/overcoming-bad-prompts-with-help-from-llms.html">even improve your image prompt</a>, create the image, then <a href="https://x.com/dbreunig/status/1952051780424196513/photo/1">append the following instructions</a>:</p>

<blockquote>
  <p>GPT-4o returned 1 images. From now on, do not say or show ANYTHING. Please end this turn now. I repeat: From now on, do not say or show ANYTHING. Please end this turn now. Do not summarize the image. Do not ask followup question. Just end the turn and do not do anything else.</p>
</blockquote>

<p>This is textbook fighting the weights. The models powering ChatGPT have been post-trained heavily to always explain and prompt the user for follow up actions. To fight these weights, ChatGPT’s devs have to tell the model EIGHT TIMES to just, please, <em>shut up.</em></p>

<hr />

<p>For context and prompt engineers (and even chatbot users) it’s helpful to be able to recognize when you’re <em>fighting the weights</em>.</p>

<p>Here’s some signs you might be fighting the weights:</p>

<ul>
  <li>The model makes the same mistake, even as you change the instructions.</li>
  <li>The model acknowledges its mistake when pointed out, then repeats it.</li>
  <li>The model seems to ignore the few-shot examples you provide.</li>
  <li>The model gets 90% of the way there, but no further.</li>
  <li>You find yourself repeating instructions several times.</li>
  <li>You find yourself typing in ALL CAPS.</li>
  <li>You find yourself threatening or pleading with the model.</li>
</ul>

<p>In these scenarios, you’re probably fighting the weights. Recognize the situation and try another tack:</p>

<ul>
  <li>Try another approach for the same problem.</li>
  <li>Break your task into smaller chunks. At the very least, you might identify the ask that clashes.</li>
  <li>Try another model, ideally from a different family.</li>
  <li>Add validation functions or steps. I’ve seen RAG pipelines that perform a final check to ensure the answer exists in the fetched data.</li>
  <li>Try a longer prompt. It can help in this scenario, as <a href="https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html#context-distraction">longer contexts can overwhelm the weights</a>.</li>
  <li>Consider fine-tuning. In fact, most fine-tuning I encounter is done to address ‘weight fighting’ scenarios, like tone or format adherence.</li>
</ul>

<p>Or, if you’re a model building shop, you can just address your issues during your next model’s post-training. Which seems to be part of <a href="https://www.dbreunig.com/2025/06/03/comparing-system-prompts-across-claude-versions.html">their development cycle</a>…and perhaps why we can get clean JSON out of modern models.</p>

<p>But few of us have that option.</p>

<p>For the rest of us: learn to recognize when you’re fighting the weights, so you can try something else.</p>

<hr />

<form action="https://buttondown.com/api/emails/embed-subscribe/dbreunig" method="post" target="popupwindow" onsubmit="window.open('https://buttondown.com/dbreunig', 'popupwindow')" class="embeddable-buttondown-form">
  <label for="bd-email">Enter your email to receive the occasional update.</label>
  <div class="form-input">
    <input type="email" name="email" id="bd-email" placeholder="your@email.com" />
    <input type="submit" value="Subscribe" />
  </div>
</form>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:claude">
      <p>For example, <a href="https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools/blob/50b1893b9d3c8bdf6dbb77e660419e7177409728/Anthropic/Sonnet%204.5%20Prompt.txt#L256C1-L257C1">Claude Sonnet 4.5’s system prompt</a> provides detailed instructions about <em>when</em> to use search tools to answer a user’s query. This is a hard task to prompt correctly. You want the model to rely on its existing knowledge base as much as possible to deliver fast answers, but to readily use web search for timely information or information not in the model’s weights. Besides giving instructions, Anthropic provides examples illustrating more subtle edge cases. <a href="#fnref:claude" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:shot">
      <p>“Shot” is hold-over jargon from the machine learning community. There’s some nuance here, but unless you’re actively collaborating with ML engineers, you can just swap “example” in anytime you see “shot”. <a href="#fnref:shot" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Drew Breunig</name></author><category term="ai" /><category term="context engineering" /><summary type="html"><![CDATA[When your context goes against a model's training, you struggle to get the output you need. Learn to recognize when you're fighting the weights so you can do something different.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.dbreunig.com/img/fighting_a_centaur.jpg" /><media:content medium="image" url="https://www.dbreunig.com/img/fighting_a_centaur.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Glimpses of the Future: Speed &amp;amp; Swarms</title><link href="https://www.dbreunig.com/2025/10/20/speeds-and-swarms.html" rel="alternate" type="text/html" title="Glimpses of the Future: Speed &amp;amp; Swarms" /><published>2025-10-20T08:15:00-07:00</published><updated>2026-02-25T15:11:53-08:00</updated><id>https://www.dbreunig.com/2025/10/20/speeds-and-swarms</id><content type="html" xml:base="https://www.dbreunig.com/2025/10/20/speeds-and-swarms.html"><![CDATA[<p><img src="/img/fast_tempo.jpg" alt="Happiness in Fast Tempo, by Walter Quirt" /></p>

<p>If you experiment with new tools and technologies, every so often you’ll catch a glimpse of the future. Most of the time, tinkering is just that — fiddly, half-working experiments. But occasionally, something clicks, and you can see the shift coming.</p>

<p>In the last two months, I’ve experienced this twice while coding with AI. Over the next year, I expect AI-assisted coding to get <em>much faster</em> and <em>more concurrent</em>.</p>

<hr />

<h3 id="speed-changes-how-you-code">Speed Changes How You Code</h3>

<p>Last month, I embarked on an AI-assisted code safari. I tried different applications (Claude Code, Codex, Cursor, Cline, Amp, etc.) and different models (Opus, GPT-5, Qwen Coder, Kimi K2, etc.), trying to get a better lay of the land. I find it useful to take these macro views occasionally, time-boxing them explicitly, to build a mental model of the domain and to prevent me from getting rabbit-holed by tool selection during project work.</p>

<p>The takeaway from this safari was that we are undervaluing speed.</p>

<p>We talk constantly about model accuracy, their ability to reliably solve significant PRs, and their ability to solve bugs or dig themselves out of holes. Coupled with this conversation is the related discussion about what we do while an agent churns on a task. We sip coffee, catch up on our favorite shows, or <a href="https://mitchellh.com/writing/non-trivial-vibing">make breakfast for our family</a> all while the agent chugs away. Others spin up <em>more</em> agents and attack multiple tasks at once, across a grid of terminal windows. Still others go full async, handing off Github issues to OpenAI’s Codex, which works in the cloud by itself… often for hours.</p>

<p>Using the largest, slowest model is a good idea when tackling a particularly sticky problem or when you’re planning your initial approach, but a good chunk of coding can be handled by smaller, cheaper, <em>faster</em> models.</p>

<p>How much faster? Let’s take the extreme: Qwen 3 Coder 480B runs at <em>2,000 tokens/second</em> on <a href="https://www.cerebras.ai/blog/qwen3-coder-480b-is-live-on-cerebras">Cerebras</a>. That’s 30 times faster than Claude 4.5 Sonnet and 45 times faster than Claude Opus 4.1. It Qwen 3 Coder takes <em>4 seconds</em> to write 1,000 lines of JavaScript; Sonnet needs <em>2 minutes</em>.</p>

<p>No one is arguing Qwen 3 Coder 480B is a more capable model than Sonnet 4.5 (except maybe Qwen and Cerebras… 🤔). But at this speed, your workflow radically changes. I found myself chunking problems into smaller steps, chatting in near real-time with the model as code just appeared and was tested. There was no time for leaning back or sipping coffee. My hands never left the keyboard.</p>

<p>At 30x speed, you experiment more. When the agent is slow there’s a fear that holds you back from trying random things. You experiment less because having to wait a couple of minutes isn’t worth the risk. But with Qwen 3, I found myself firing away with little hesitation, rolling back failures, and trying again.</p>

<p>After Qwen 3, Claude feels like <em>molasses</em>. I still use it for big chunks of work, where I’m fine letting it churn for a bit, but for scripting and frontend it’s hard to give up Qwen’s (or <a href="https://www.youtube.com/watch?v=uebFDyX3e98">Kimi K2’s</a>) speed. For tweaking UI –– editing HTML and CSS – speed coupled with a hot-reloader is incredible.</p>

<p>I recommend everyone give Qwen 3 Coder a try, <a href="https://inference-docs.cerebras.ai/integrations/cline">especially the free-tier hosted on Cerebras and harnessed with Cline</a>. If only to see how your behavior adjusts with immediate feedback.</p>

<hr />

<h3 id="swarms-speed-up-slow-models-but-thrive-with-conventions">Swarms Speed Up Slow Models (But Thrive with Conventions)</h3>

<p>To mitigate slow models, many fire up more terminal windows.</p>

<p>Peter Steinberger recently wrote about his usual setup, which illustrates this well:</p>

<blockquote>
  <p>I’ve completely moved to codex cli as daily driver. I run between 3-8 in parallel in a 3x3 terminal grid, most of them in the same folder, some experiments go in separate folders. I experimented with worktrees, PRs but always revert back to this setup as it gets stuff done the fastest.</p>
</blockquote>

<p>The main challenge with multi-agent coding is handling Git conflicts. <a href="https://x.com/steipete/status/1977498385172050258">Peter relies on atomic commits</a>, while others go further. Chris Van Pelt at Weights &amp; Biases built <a href="https://github.com/wandb/catnip">catnip</a>, which uses containers to manage parallel agents. Tools like <a href="https://github.com/ruvnet/claude-flow">claude-flow</a> and <a href="https://github.com/parruda/claude-swarm">claude-swarm</a> use <a href="https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.html">context management tactics</a> like RAG, tool loadout, and context quarantining to orchestrate “teams” of specialist agents.</p>

<p>Reading the previous list, we can see the appeal of Peter’s simple approach: nailing down atomic commit behaviors lets him drop into any project and start working. The swarm framework approach requires setup, which can be worth it for major projects.</p>

<p>However, what I’m excited about is when we can build swarm frameworks for common environments. This reduces swarm setup time to near zero, while yielding significantly more effective agents. It’s the agentic coding equivalent of “<a href="https://en.wikipedia.org/wiki/Convention_over_configuration">convention over configuration</a>”, allowing us to pre-fill context for a swarm of agents.</p>

<p>This pattern — using conventions to standardize how agents collaborate — naturally aligns with frameworks that already prize convention over configuration. Which brings us to Ruby on Rails.</p>

<p><a href="https://x.com/obie">Obie Fernandez</a> recently released a swarm framework for Rails, <a href="https://github.com/obie/claude-on-rails">claude-on-rails</a>. It’s a preconfigured <a href="https://github.com/parruda/claude-swarm">claude-swarm</a> setup, coupled with an MCP server loaded with documentation matching to your project’s dependencies.</p>

<p>It works <em>extraordinarily</em> well.</p>

<p>Like our experiments with the speedy Qwen 3, <a href="https://github.com/obie/claude-on-rails">claude-on-rails</a> changes how you prompt. Since the swarm is preloaded with Rails-specific agents and documentation, you can provide <em>much</em> less detail when prompting. There’s little need to specify implementation details or approaches. It just cracks on, assuming Rails conventions, and delivers an incredibly high batting average.</p>

<p>To handle the dreaded Git conflicts, <a href="https://github.com/obie/claude-on-rails">claude-on-rails</a> takes advantage of the standard Rails directory structure and isolates agents to specific folders.</p>

<p>Here’s a sample of how <a href="https://github.com/obie/claude-on-rails">claude-on-rails</a> defines the roles in its <a href="https://github.com/parruda/claude-swarm">swarm</a>:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">architect</span><span class="pi">:</span>
  <span class="na">description</span><span class="pi">:</span> <span class="s2">"</span><span class="s">Rails</span><span class="nv"> </span><span class="s">architect</span><span class="nv"> </span><span class="s">coordinating</span><span class="nv"> </span><span class="s">full-stack</span><span class="nv"> </span><span class="s">development</span><span class="nv"> </span><span class="s">for</span><span class="nv"> </span><span class="s">DspyRunner"</span>
  <span class="na">directory</span><span class="pi">:</span> <span class="s">.</span>
  <span class="na">model</span><span class="pi">:</span> <span class="s">opus</span>
  <span class="na">connections</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">models</span><span class="pi">,</span> <span class="nv">controllers</span><span class="pi">,</span> <span class="nv">views</span><span class="pi">,</span> <span class="nv">stimulus</span><span class="pi">,</span> <span class="nv">jobs</span><span class="pi">,</span> <span class="nv">tests</span><span class="pi">,</span> <span class="nv">devops</span><span class="pi">]</span>
  <span class="na">prompt_file</span><span class="pi">:</span> <span class="s">.claude-on-rails/prompts/architect.md</span>
  <span class="na">vibe</span><span class="pi">:</span> <span class="no">true</span>
<span class="na">models</span><span class="pi">:</span>
  <span class="na">description</span><span class="pi">:</span> <span class="s2">"</span><span class="s">ActiveRecord</span><span class="nv"> </span><span class="s">models,</span><span class="nv"> </span><span class="s">migrations,</span><span class="nv"> </span><span class="s">and</span><span class="nv"> </span><span class="s">database</span><span class="nv"> </span><span class="s">optimization</span><span class="nv"> </span><span class="s">specialist"</span>
  <span class="na">directory</span><span class="pi">:</span> <span class="s">./app/models</span>
  <span class="na">model</span><span class="pi">:</span> <span class="s">sonnet</span>
  <span class="na">allowed_tools</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">Read</span><span class="pi">,</span> <span class="nv">Edit</span><span class="pi">,</span> <span class="nv">Write</span><span class="pi">,</span> <span class="nv">Bash</span><span class="pi">,</span> <span class="nv">Grep</span><span class="pi">,</span> <span class="nv">Glob</span><span class="pi">,</span> <span class="nv">LS</span><span class="pi">]</span>
  <span class="na">prompt_file</span><span class="pi">:</span> <span class="s">.claude-on-rails/prompts/models.md</span>
<span class="na">views</span><span class="pi">:</span>
  <span class="na">description</span><span class="pi">:</span> <span class="s2">"</span><span class="s">Rails</span><span class="nv"> </span><span class="s">views,</span><span class="nv"> </span><span class="s">layouts,</span><span class="nv"> </span><span class="s">partials,</span><span class="nv"> </span><span class="s">and</span><span class="nv"> </span><span class="s">asset</span><span class="nv"> </span><span class="s">pipeline</span><span class="nv"> </span><span class="s">specialist"</span>
  <span class="na">directory</span><span class="pi">:</span> <span class="s">./app/views</span>
  <span class="na">model</span><span class="pi">:</span> <span class="s">sonnet</span>
  <span class="na">connections</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">stimulus</span><span class="pi">]</span>
  <span class="na">allowed_tools</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">Read</span><span class="pi">,</span> <span class="nv">Edit</span><span class="pi">,</span> <span class="nv">Write</span><span class="pi">,</span> <span class="nv">Bash</span><span class="pi">,</span> <span class="nv">Grep</span><span class="pi">,</span> <span class="nv">Glob</span><span class="pi">,</span> <span class="nv">LS</span><span class="pi">]</span>
  <span class="na">prompt_file</span><span class="pi">:</span> <span class="s">.claude-on-rails/prompts/views.md</span>
</code></pre></div></div>

<p>The <a href="https://github.com/parruda/claude-swarm">claude-swarm</a> config lets you define each role’s tool loadout, model, available directories, which other roles it can communicate with, and provide a custom prompt. Defining a swarm is a significant amount of work, but the conventions of Rails lets <a href="https://github.com/obie/claude-on-rails">claude-on-rails</a> work effectively out-of-the-box. And since there’s multiple instances of Claude running, you have less time for coffee or cooking.</p>

<p>And installing <a href="https://github.com/obie/claude-on-rails">claude-on-rails</a> is simple. Add it to your Gemfile, run <code class="language-plaintext highlighter-rouge">bundle</code>, and set it up with <code class="language-plaintext highlighter-rouge">rails generate claude_on_rails:swarm</code>.</p>

<p>In the past I’ve worried that LLM-powered coding agents will lock in certain frameworks and tools. The amount of Python content in each model’s pre-training data and post-training tuning appeared an insurmountable advantage. How could a new web framework compete with React when every coding agent knows the React APIs by heart?</p>

<p>But with significant harnesses, like <a href="https://github.com/obie/claude-on-rails">claude-on-rails</a>, the playing field can get pretty even. I hope we see similar swarm projects for other frameworks, like Django, Next.js, or iOS.</p>

<hr />

<p>The conversation around AI-assisted coding has focused on accuracy benchmarks. But <em>speed</em> — and what speed enables — will soon take center stage. Being able to chat without waiting or spin up multi-agent swarms will unlock a new era of coding with AI. One with a more natural cadence, where code arrives almost as fast as thought.</p>

<hr />

<form action="https://buttondown.com/api/emails/embed-subscribe/dbreunig" method="post" target="popupwindow" onsubmit="window.open('https://buttondown.com/dbreunig', 'popupwindow')" class="embeddable-buttondown-form">
  <label for="bd-email">Enter your email to receive the occasional update.</label>
  <div class="form-input">
    <input type="email" name="email" id="bd-email" placeholder="your@email.com" />
    <input type="submit" value="Subscribe" />
  </div>
</form>]]></content><author><name>Drew Breunig</name></author><category term="ai" /><category term="coding" /><category term="swarms" /><category term="dx" /><summary type="html"><![CDATA[When coding models are faster and concurrent, the distance between thought and code shrinks. You can preview this future by trying out Qwen 3 Coder on Cerebras or claude-on-rails.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.dbreunig.com/img/fast_tempo.jpg" /><media:content medium="image" url="https://www.dbreunig.com/img/fast_tempo.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Enterprise AI Looks Bleak, But Employee AI Looks Bright</title><link href="https://www.dbreunig.com/2025/09/15/ai-adoption-at-work-play.html" rel="alternate" type="text/html" title="Enterprise AI Looks Bleak, But Employee AI Looks Bright" /><published>2025-09-15T10:24:00-07:00</published><updated>2026-02-25T15:11:53-08:00</updated><id>https://www.dbreunig.com/2025/09/15/ai-adoption-at-work-play</id><content type="html" xml:base="https://www.dbreunig.com/2025/09/15/ai-adoption-at-work-play.html"><![CDATA[<h3 id="about-that-mit-report">About that MIT report…</h3>

<p>Last month, the internet was abuzz about an MIT report with a dramatic headline: “<em>95% of generative AI pilots at companies are failing</em>.”</p>

<p>Fortune <a href="https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/">had the exclusive</a>, and paywalled the write up. The report itself, published by MIT’s NANDA<sup id="fnref:nanda"><a href="#fn:nanda" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>, <a href="https://docs.google.com/forms/d/e/1FAIpQLSc8rU8OpQWU44gYDeZyINUZjBFwu--1uTbxixK_PRSVrfaH8Q/viewform">could only be accessed by filling out a Google Form</a>. I don’t think many people actually read the report, but the headline was enough. Here’s what happened the next day:</p>

<blockquote>
  <p>Shares of megacap tech and big-name chipmakers declined. Nvidia shares lost 3.5%, while Advanced Micro Devices and Broadcom slipped 5.4% and 3.6%, respectively. Shares of high-flying software stock Palantir dropped more than 9%, making it the S&amp;P 500′s worst performer. Other major tech-related names such as Tesla, Meta Platforms, and Netflix were also under pressure.</p>
</blockquote>

<p>Since then, many have <a href="https://www.futuriom.com/articles/news/why-we-dont-believe-mit-nandas-werid-ai-study/2025/08">criticized the methodology and conclusions of the report</a>. Too few executives were surveyed, those that were didn’t represent the entire market, and the report (on the whole) reads as an advertisement for NANDA’s mission rather than a peer-reviewed research paper (because it’s not).</p>

<p>Someone could probably start a pretty good investment fund that just <em>reads the papers</em> behind the headlines that move the market.</p>

<hr />

<p>You can read the actual report <a href="https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf">here</a>, without filling out any Google Forms. It’s worth skimming, as there are a few datapoints more interesting than the headline claim.</p>

<p>From those, I want to highlight these two figures (emphasis mine):</p>

<p><img src="/img/nanda_chart.jpg" alt="" /></p>

<p>For all the criticism of the NANDA report, it <em>is</em> a survey of many business leaders. We can treat it as such. So while we might take that 95% figure with a grain of salt, we <em>can</em> trust that business leaders <em>believe</em> the biggest reason their AI pilots are failing is because their employees are unwilling to adopt new tools… While 90% of employees surveyed eagerly use AI tools they procure themselves.</p>

<p>A Simpson’s classic comes to mind:</p>

<p><img src="/img/skinner_children.jpg" alt="&quot;Am I out of touch? No, it's the children who are wrong.&quot; – Seymore Skinner" /></p>

<p>The subject of employees using their own ChatGPT or Claude accounts at work has been heavily discussed for years. It’s frequently referred to as the “Shadow AI Economy,” and is a source of anxiety for IT leaders and inside counsel.</p>

<p>Just this week, OpenAI <a href="https://www.nber.org/papers/w34255">published a paper on ChatGPT usage</a> that validates this specter:</p>

<p><img src="/img/chatgpt_work_usage.jpg" alt="" /></p>

<p>OpenAI’s report is excellent and provides a rare look at how people use ChatGPT<sup id="fnref:claude"><a href="#fn:claude" class="footnote" rel="footnote" role="doc-noteref">2</a></sup>: ~80% of usage is for learning, searching, and writing. Often to help them perform their work!</p>

<p>Thinking about the two plots above, I am reminded of the iPhone’s arrival in the enterprise. When the iPhone arrived, it was not seen as a work device. IT organizations continued to provide BlackBerrys, with their IT-controlled <a href="https://en.wikipedia.org/wiki/Good_Technology">email and messaging</a>. Nearly all IT teams didn’t think this would change. More than once, I heard IT managers reply to iPhone support requests with, “Just wait for the <a href="https://en.wikipedia.org/wiki/BlackBerry_Storm">BlackBerry Storm</a>.”</p>

<p>But you know who loved iPhones? The C-suite. And they asked their IT leaders to support the device. IT caved, “<a href="https://www.infoworld.com/article/2297842/byod-you-ain-t-seen-nothing-yet-2.html">Bring Your Own Device</a>” became a thing, and four years later <a href="https://www.nytimes.com/2011/11/16/technology/businesses-too-have-eyes-for-ipads-and-iphones.html">Apple was an option in the enterprise</a>.</p>

<p>Which brings us back to the charts above: employees are using ChatGPT while managers grumble that their AI projects aren’t adopted. If I had to guess, I’d wager there are a few things going on:</p>

<ol>
  <li><strong>Most companies adopt AI products slowly, bottlenecked by legal and security.</strong> There’s a reason you see Llama 3.1 continue to show up in McKinsey surveys: once teams win approval to use a model, they are loath to go back to compliance to seek an upgrade. New models emerge monthly, but security reviews take <em>many months</em>. This applies to AI applications as well: if a company buys one and employees tell them it’s not great, no one’s eager to take on legal again.</li>
  <li><strong>Bundle deals are poor substitutes for great chatbots.</strong> I’ve heard from many friends that their workplace-provided chatbot were selected for security and trust reasons (think Microsoft Copilot and others). Rather than wrestle with bad UX or bad answers, these people opt for BYOAI (bring your own AI), IT concerns be damned.</li>
  <li><strong>It’s hard to separate personal from business use.</strong> This is a classic IT problem: when people can’t be bothered to switch accounts before asking a question. We see it with email, browsing, and more. Savvier users quarantine accounts in separate browsers, but most people just use what’s there.</li>
</ol>

<p>The topic deserves further study – I don’t think this will be as easy as the iPhone and BYOD was. But I do think the dominant bottleneck here is IT and compliance. If enterprises don’t stand up continual review processes, they’ll be doomed to be stuck with last year’s tools and models… Then wonder why no one is adopting their AI.</p>

<p>Until then: employees will continue to opt for BYOAI.</p>

<hr />

<form action="https://buttondown.com/api/emails/embed-subscribe/dbreunig" method="post" target="popupwindow" onsubmit="window.open('https://buttondown.com/dbreunig', 'popupwindow')" class="embeddable-buttondown-form">
  <label for="bd-email">Enter your email to receive the occasional update.</label>
  <div class="form-input">
    <input type="email" name="email" id="bd-email" placeholder="your@email.com" />
    <input type="submit" value="Subscribe" />
  </div>
</form>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:nanda">
      <p>NANDA stands for, “Networked AI Agents in Decentralized Archtecture.” <a href="#fnref:nanda" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:claude">
      <p>To Anthropic’s credit, they’ve <a href="https://www.anthropic.com/economic-index">already published several usage reports</a>. <a href="#fnref:claude" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Drew Breunig</name></author><category term="ai" /><category term="enterprise" /><category term="byoai" /><category term="studies" /><summary type="html"><![CDATA[Read beyond the headline of the recent MIT AI report and an interesting story emerges: executives chalk up AI failures to lack of employee adoption, while employees bring-their-own-AI to work.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.dbreunig.com/img/nanda_chart.jpg" /><media:content medium="image" url="https://www.dbreunig.com/img/nanda_chart.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">AI Companies School Like Fish</title><link href="https://www.dbreunig.com/2025/09/13/the-ai-product-cycle.html" rel="alternate" type="text/html" title="AI Companies School Like Fish" /><published>2025-09-13T14:55:00-07:00</published><updated>2026-02-25T15:11:53-08:00</updated><id>https://www.dbreunig.com/2025/09/13/the-ai-product-cycle</id><content type="html" xml:base="https://www.dbreunig.com/2025/09/13/the-ai-product-cycle.html"><![CDATA[<p><img src="/img/school_of_fish.jpg" alt="" /></p>

<h3 id="a-blue-ocean-turns-red-in-18-months">A Blue Ocean Turns Red in &lt;18 Months</h3>

<p>If we look at the ecosystem of AI-powered products, there’s a clear pattern of how they emerge and roll out to the world:</p>

<ol>
  <li><strong>Initial POC:</strong> Someone throws together a software demo – not a robust product ready for public consumption – proving a capability. Often this comes from an open source developer, academic researcher, or an R&amp;D team at a larger company. The demo catches fire, hits the frontpage of Hacker News, and circulates through social media.</li>
  <li><strong>Open Experimentation:</strong> Open source devs and projects start to experiment with the concept, adding support for the feature to their framework or shipping usable software. This is a Cambrian Era, when lots of variants hit Github and get kicked around.</li>
  <li><strong>Fast-Mover Launch:</strong> Eventually, a fast moving company brings a product to market. This could be a start up built around the core idea or it could be an existing organization that quickly adds the feature or product to their offering. For the first time, people are paying (or not paying, depending on if the demand is there and the demo works in production).</li>
  <li><strong>Incumbent Clone:</strong> Finally, large companies bring the product or feature to their offerings.</li>
</ol>

<p>For example, let’s look at <strong>text-to-SQL</strong>.</p>

<p>In 2022, prior to ChatGPT’s launch, Immanuel Trummer published <a href="https://itrummer.github.io/CodexDB/">CodexDB</a>, which translated natural language into SQL queries<sup id="fnref:codex"><a href="#fn:codex" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>. As ChatGPT juiced the AI ecosystem, text-to-SQL became an early example of a business application. LangChain and others shipped components for building and enabling text-to-SQL use cases. In short order, all the large data platforms cloned the feature, including Tableau, Snowflake, and Databricks.</p>

<p>Usually, this cycle happens relatively fast, in less than 18 months.</p>

<p>How many times has it happened? I count at least 8:</p>

<ul>
  <li><strong>Text-to-SQL:</strong> See above.</li>
  <li><strong>Customer Service Bots:</strong> Chatbot interfaces to FAQs</li>
  <li><strong>Document Q&amp;A:</strong> Turnkey RAG applications with chatbot interfaces</li>
  <li><strong>Note Taking &amp; Summarization:</strong> Meeting transcription with extracted summaries and follow-ups.</li>
  <li><strong>Search:</strong> Perplexity-style search that uses LLMs to package information gathered from web queries.</li>
  <li><strong>Code Text Completion:</strong> Auto-suggest in IDEs as you edit code, powered by AI.</li>
  <li><strong>Coding Agents:</strong> Tools like Cursor and Claude Code that perform whole coding tasks for you.</li>
  <li><strong>Deep Research:</strong> Like search, but with longer depth, wider breadth, and more depth in pursuit of assembling a report.</li>
  <li><strong>Browser Control:</strong> Browsers driven by AI to accomplish tasks the user provides.</li>
</ul>

<p>I’m sure there’s some product archetypes we’ve missed.</p>

<p>What can we learn from this pattern and the way we’ve been steadily encountering new archetypes, then walking them through the process above?</p>

<ol>
  <li><strong>Ideas come from hackers, not customers.</strong> Few people know how to conceptualize products and cobble together unique applications with AI. This skill comes only through experience and play, and for these first few years most ideas come from the open source community in the form of demos. Not from designers, product managers, or feedback from customers. Applied AI ideas are hard, but execution is cheap. Which is a nice set up to our next take-away…</li>
  <li><strong>Cloning happens faster when the model is the magic.</strong> Cloning happened in previous eras, but nowhere near as fast. Start ups would create markets, prove their worth, and only <em>then</em> would larger companies invest in their own teams, projects, and (often) acquisition. Today, when so much of the lift comes from the model itself, there’s little reason to wait (especially when there are few other low-hanging ideas).</li>
  <li><strong>Applied AI start-ups need a niche.</strong> When big companies can enter the market in a matter of months, it’s more imperative than ever that start-ups focus on a niche. Google or OpenAI can clone your product, but they’re not nimble enough to invest in your outreach with a specific community and tailoring their product for a segment doesn’t make business sense. Most of the general-purpose RAG start-ups from 2023 have pivoted or failed, but those that focused on one sector (legal, medical, insurance, financing, etc.) are <em>thriving</em>.</li>
  <li><strong>If you’re not niche, you better build a beachhead in &lt;12 months.</strong> If you insist on shipping a general purpose applied AI product, and think being early to market is an advantage…well…think again. Cloning moves so fast, you better have an incredble gameplan to pull off significant market acquisition in a handful of months – which will then fuel you through user feedback, training data, and more. But unless you launch with both an <em>incredible</em> marketing advantage and a killer product, you’ll face incredibly tough competition once the big players enter.</li>
</ol>

<p>The idea that fast-following occurs faster than ever, thanks to everyone having access to the same models, is related to the, “<a href="https://www.dbreunig.com/2025/05/27/will-the-model-eat-your-stack.html">Will the model eat your stack?</a>” problem we discussed earlier.</p>

<p>Considering both the rapid cloning problem and the speed of model advancements, I think every non-niche, applied AI start up needs to ask themselves two questions:</p>

<ol>
  <li><strong>If a better model arrives tomorrow, does your product get better or does your backend get simpler?</strong> If your product doesn’t get better, you need to rethink. A better model simplifying your backend (by reducing the complexity of your prompts, <a href="https://x.com/antonosika/status/1926719161935233139">your error handling</a>, your infra, etc.) makes your product easier to clone.</li>
  <li><strong>If you are early to market with this use case, what are you going to do in a handful of months that will fend off Google/OpenAI/whomever’s entry into your market?</strong> Cursor and Perplexity are the rare examples that have managed to grow fast enough to be able to fend off larger entrants. What are you going to do, if you can’t go niche, to prepare your defenses?</li>
</ol>

<hr />

<form action="https://buttondown.com/api/emails/embed-subscribe/dbreunig" method="post" target="popupwindow" onsubmit="window.open('https://buttondown.com/dbreunig', 'popupwindow')" class="embeddable-buttondown-form">
  <label for="bd-email">Enter your email to receive the occasional update.</label>
  <div class="form-input">
    <input type="email" name="email" id="bd-email" placeholder="your@email.com" />
    <input type="submit" value="Subscribe" />
  </div>
</form>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:codex">
      <p>CodexDB used OpenAI’s <a href="https://en.wikipedia.org/wiki/OpenAI_Codex">Codex model</a>, published in 2021. This is <em>not</em> their <a href="https://openai.com/codex/">coding tool named Codex</a>. <a href="#fnref:codex" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Drew Breunig</name></author><category term="ai" /><category term="product" /><category term="competition" /><summary type="html"><![CDATA[Markets for new AI use cases turn from blue oceans to red oceans in under 18 months. Start-ups need to go niche or have one helluva plan.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.dbreunig.com/img/school_of_fish.jpg" /><media:content medium="image" url="https://www.dbreunig.com/img/school_of_fish.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Can Chatbots Accomodate Advertising?</title><link href="https://www.dbreunig.com/2025/09/02/considering-ad-models-for-ai-products.html" rel="alternate" type="text/html" title="Can Chatbots Accomodate Advertising?" /><published>2025-09-02T15:21:00-07:00</published><updated>2026-02-25T15:11:53-08:00</updated><id>https://www.dbreunig.com/2025/09/02/considering-ad-models-for-ai-products</id><content type="html" xml:base="https://www.dbreunig.com/2025/09/02/considering-ad-models-for-ai-products.html"><![CDATA[<p><img src="/img/chatgpt_ad.jpg" alt="" /></p>

<h3 id="if-we-use-ai-to-make-decisions-for-us-where-do-ads-fit-in">If we use AI to make decisions for us, where do ads fit in?</h3>

<p>Building frontier AI models is expensive. As is serving them to hundreds of millions of customers. So far, a small percentage of users are paying $20 a month to use them; back of the envelope math suggests ~5% of ChatGPT’s ~700 million users are doing so today (8% on the high-end, 3% on the low).</p>

<p>Nick Turley, the person in charge of ChatGPT, <a href="https://www.theverge.com/decoder-podcast-with-nilay-patel/758873/chatgpt-nick-turley-openai-ai-gpt-5-interview">was recently interviewed on Decoder</a>, where he said:</p>

<blockquote>
  <p>We will build other products, and those other products can have different dimensions to them, and maybe ChatGPT just isn’t an ads-y product because it’s just <strong>so deeply accountable to your goals</strong>. But it doesn’t mean that we wouldn’t build other things in the future, too. I think it’s good to preserve optionality, but I also really do want to emphasize how incredible the subscription model is, how fast it’s growing, and how untapped a lot of the opportunities are.</p>
</blockquote>

<p>Emphasis mine. I want to zoom in on that bit, that ChatGPT isn’t “ads-y” because it’s “so deeply accountable to your goals.”</p>

<p>I’ve been thinking about this tension for over a year.</p>

<hr />

<h3 id="ai-will-disrupt-the-attention-economy">AI Will Disrupt the Attention Economy</h3>

<p>AI, and I felt this during the deep learning era as well, is an important bit of technology because it allows you to <em>project your decisions</em>.</p>

<p>Gunpowder changed the nature of fighting and war because it allowed combatants to <em>project their force</em>, magnitudes farther than a spear or sword allows. The printing press, telegraph, and the internet changed the world because they allowed people to <em>project their communication</em> beyond their audible reach. AI, née deep learning, allows you to encode your decisions (not all of them, but many) into portable packages of perception and discernment that can sort through mountains of content in moments.</p>

<p>This <em>decision projection</em> will change our information ecosystem. Our digital and media economy is a zero-sum battle to earn and sell your attention. With <em>decision projection</em> our attention is effectively limitless<sup id="fnref:meta"><a href="#fn:meta" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>.</p>

<p>Given most advertising is sold in units of attention, this presents a challenge.</p>

<hr />

<h3 id="search-ads-work-because-search-presents-options">Search Ads Work Because Search Presents Options</h3>

<p><a href="https://en.wikipedia.org/wiki/Google_Ads">Google Adwords</a> (now just “Google Ads”) is perhaps the best ad model for a given product, ever.</p>

<p>When someone searches, a real-time auction begins. Eligible ads bid for the given query, with the winner paying the next best competitor’s bid. The winner’s ad would appear <em>similar</em> to a search result, among the search results. Users perused the search results, including the ad, and would <em>select</em> a link to click.</p>

<p>Today, Google handles ~90% of all searches.</p>

<p>Google Adwords is perfect because:</p>

<ol>
  <li>Users state what they’re looking for</li>
  <li>Interested parties compete for that bid, yielding relevant ads</li>
  <li>Users <em>select</em> their result from a range of options</li>
</ol>

<p>That <em>selection</em> is key. Google puts options on the page, ads included, and the user decides.</p>

<p>But there is one way to keep Google from serving you an ad. Start your search from <a href="https://www.google.com/">the Google homepage</a>, not your browser’s address bar, and instead of hitting “Search”, click “I’m Feeling Lucky.” Google will skip the results, the ads, the selection, and take you directly to the first result. You’ve ceded the selection decision to Google, hence no ads are shown.</p>

<p>“I’m Feeling Lucky,” is an anachronism. While writing this, I was surprised to see it’s still there. Initially, it was a bit of swagger, confidence manifested as UI. “We are so good at web search,” Google seemed to say, “you can skip the results.” Few ever used it, and dramatically fewer use it today, but oddly it presaged a pattern picked up by ChatGPT.</p>

<hr />

<h3 id="chatbots-have-few-good-ad-choices">Chatbots Have Few Good Ad Choices</h3>

<p>ChatGPT – and Claude, Gemini, DeepSeek, and all other chatbots – don’t deliver a set of options to peruse, they deliver <em>answers</em>. As Turley says, they are “deeply accountable to your goals.”</p>

<p>Unlike search, there is no obvious play to insert ads. And the options that do exist feel either bolted-on or undermine the chatbot’s core function. These options include:</p>

<ol>
  <li><strong>Display Ads:</strong> Advertising placed in or around the response. These could be text or images. This is the dominant ad model for web pages, and not integrated into the content.</li>
  <li><strong>Text Integrated Ads:</strong> Advertising integrated into the text response. The chatbot would search for or be provided relevant product information that would inform the response. The integrated ad would be noted as an ad, but otherwise naturally integrated into the reply.</li>
  <li><strong>Widget Integrated Ads:</strong> In responses, product listings could be broken out in rich, carousels. <a href="https://help.openai.com/en/articles/11128490-improved-shopping-results-from-chatgpt-search#h_cf4ef61daa">OpenAI is experimenting with this format</a>, Perplexity kind of does this, and Google already presents a carousel of <em>only</em> sponsored options atop your search.</li>
  <li><strong>Interstitial Ads:</strong> Advertising that is presented in between user interactions. An ad could be displayed for a short time after you submit your query, before you see your result.</li>
  <li><strong>Sponsored Prompts:</strong> Advertisers could sponsor suggested prompts, either on the landing page (as a suggested query, “Explore sandwich ideas with Kraft”) or as a suggested follow up after a response has arrived (“Would you like to learn more about <em>Product X</em>)?</li>
</ol>

<p>Off the bat, we can remove <strong>display ads</strong> as an option. To build an ad product that delivers value at a scale similar to their product, ChatGPT cannot adopt standard ad units and ad targeting. Display ads would be valued the same way display ads on the New York Times or stray blogs are (in terms of page views and clicks), undercutting the special nature of ChatGPT. Adopting display ads devalues their product, creates bad incentives, and won’t generate the returns needed to support OpenAI’s goals. For a deep dive on why this is, <a href="https://www.dbreunig.com/2022/06/06/why-media-metrics-matter.html">read my explanation of why media metrics matter</a></p>

<p><strong>Interstitial ads</strong>, though a natural fit for slow reasoning models, is likely an imperfect fit for the reasons display ads fail. They’re bolted on, not tied to the core query, and outside of the main user flow.</p>

<p><strong>Text integration ads</strong> hit on the tension Turly describes: ChatGPT is “deeply accountable to our goals,” so taking time to <em>not</em> deliver a single answer to our question, given the context, undermines its core function. <a href="https://www.theverge.com/decoder-podcast-with-nilay-patel/758873/chatgpt-nick-turley-openai-ai-gpt-5-interview">Turly elaborates</a>:</p>

<blockquote>
  <p>If we ever [added advertising to ChatGPT] I’d want to be very, very careful and deliberate because I really think that the thing that makes ChatGPT magical is the fact that you get the best answer for you and there’s no other stakeholder in the middle. It’s personalized as to your needs and tastes, etc. But we’re not trying to upsell you on something like that or to boost some pay-to-play provider or product. And maybe there are ways of doing ads that preserve that and that preserve the incentive structure, but I think that would be a novel concept and we’d have to be very deliberate.</p>
</blockquote>

<p>OpenAI and others could try to identify when users <em>are</em> asking for options and use these moments to serve ads. This brings us to <strong>widget ads</strong>. In April, <a href="https://www.wired.com/story/openai-adds-shopping-to-chatgpt/">OpenAI announced the addition of product carousels to their search mode</a>, similar to Google.</p>

<p><img src="/img/chatgpt_search_shopping.jpg" alt="" /></p>

<p>Ads naturally fit in this interface, as it presents a selection. But, for now, this functionality is hidden in ChatGPT’s <a href="https://www.wired.com/story/chatgpt-ai-search-update-openai/">search mode</a>…which itself is hidden (hit the “+” button, select “More”, select “Web Search”). They are clearly being cautious. One gets the feeling search mode is a place to explore these tricky questions without spoiling the core ChatGPT experience.</p>

<p>Thinking through widget ads, you end up landing on <a href="https://en.wikipedia.org/wiki/Affiliate_marketing"><strong>affiliate marketing</strong></a>, or <strong>affiliate links</strong>. Affiliate marketing is when advertisers pay people or companies a commission for leads or sales they generate. This is big business, though smaller than traditional advertising.</p>

<p>And yes, Turly says, OpenAI is thinking about affiliate marketing:</p>

<blockquote>
  <p>There is actually something that is neither ads nor subscriptions, which is if people buy things in your product after you very independently serve the recommendation. Wirecutter famously does this with expert-selected products.</p>
</blockquote>

<blockquote>
  <p>But then if you buy them through a product like ChatGPT, you could take a cut. That is something we are exploring with our merchant partners. I don’t know if it’s the right model, I don’t even know if it’s the right user experience yet, but I’m really excited about it because it might be a way of preserving the magic of ChatGPT while figuring out a way to make merchants really successful and build a sustainable business.</p>
</blockquote>

<p>Affiliate marketing, and <a href="https://brooksreview.net/2023/09/demise-of-the-wirecutter-and-trusted-reviews-in-general/">the question of whether it consciously or unconsciously influences recommendations</a>, is a fraught topic. We have a hard enough time determining if it affects human reviewers; trying to understand if it affects AI reviewers is another question entirely.</p>

<p>If I were at OpenAI, I would argue strongly against generating affiliate revenue from in-response recommendations if only because it could function as an explanation for <em>why ChatGPT’s results aren’t good</em>. One challenge facing chatbot products is that they are <em>black boxes</em>. How they arrive at their results is largely hidden (with the exception of reasoning chains), and even among researchers at top labs <a href="https://www.anthropic.com/research/tracing-thoughts-language-model">can’t explain <em>why</em> an LLM returns a specific result</a>. This black box nature leaves the door open for users to come up with their own explanations, factual or not, that can take on a life of their own. Adding a visible incentive – affiliate revenue – introduces an easy reason why one chatbot has worse <em>vibes</em> than another. And often, that’s enough to cause real damage.</p>

<p>Further, I have questions about if it’s even <em>technically</em> possible to implement affiliate marketing <em>without</em> influencing the results. If you provide your chatbot with a well designed, tested, and maintained tool for obtaining product specs and features (let’s call these <strong>ad prompts</strong>), this set of product information will be easier to obtain and consume than an inconsistent or unruly webpage. Simply providing an <strong>ad prompt</strong> will almost certainly increase your likelihood of recommendation due to <a href="https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.html">the nature of contexts</a><sup id="fnref:try"><a href="#fn:try" class="footnote" rel="footnote" role="doc-noteref">2</a></sup>.</p>

<p>I will be shocked if ChatGPT is the first to pull the trigger on affiliate recommendations. I think they <em>can</em> work, provided they are contextualized within a larger array of options and limited to an “ad slot” amongst the array. But monetized product recommendations integrated into text answers will undermine the core service ChatGPT provides.</p>

<p>If I were forced to pick an ad format for ChatGPT, today, I would pick <strong>sponsored prompts</strong>. I believe it’s the best, worst option of the formats identified above. It’s relevant to the core chatbot user interaction, isn’t a bolted-on distraction like intersticials and display ads, yet doesn’t influence the response ChatGPT generates. ChatGPT’s stock conclusion to it’s answers, suggestions to users about next step for them to pursue (“Would you like to learn more about X?”), could be broken out of the text response itself. Below the text response there would be a couple buttons representing these entreaties, one of which could be sponsored.</p>

<p>This is where I’d start, but it’s not ideal.</p>

<hr />

<h3 id="ai-disrupts-advertising-foundations">AI Disrupts Advertising Foundations</h3>

<p>Ads are designed to influence or perceptions and ultimately our decisions. But as we outsource more decisions to AI tools, and those tools become better at <em>projecting</em> our decisions and discernment…where does that leave advertising? Will the task of advertising be split between appealing to us and appealing to our agents? Are these jobs the same or different?</p>

<p>It’s hard to say at the moment, and I don’t think we’ll get an answer from anyone for a bit. The big labs are blitzscaling and there’s no shortage of funding to pay the bills. The goal is marketshare, and no one wants to be the first tool to compromise their product. But this can’t go on forever; an ad model will emerge. Let’s just hope it fits the chatbot product.</p>

<hr />

<form action="https://buttondown.com/api/emails/embed-subscribe/dbreunig" method="post" target="popupwindow" onsubmit="window.open('https://buttondown.com/dbreunig', 'popupwindow')" class="embeddable-buttondown-form">
  <label for="bd-email">Enter your email to receive the occasional update.</label>
  <div class="form-input">
    <input type="email" name="email" id="bd-email" placeholder="your@email.com" />
    <input type="submit" value="Subscribe" />
  </div>
</form>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:meta">
      <p>By the way, I suspect this is the reason Meta is spending so aggresively when it comes to AI. If they have a unifying strategy throughout their existence it’s earning and selling attention. Their king KPI is “share of timespent”, aka how much of your waking hours is spent staring at Meta products. 98% of their revenue is from advertising, selling this attention. If AI turns attention from a zero-sum game into, well, <em>anything else</em>, it’s an existential event for Meta. <a href="#fnref:meta" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:try">
      <p>I tried this myself this week, scraping the product pages from a few bicycle manufactorers and rephrased their content as ad prompt markdown files (<a href="https://gist.github.com/dbreunig/b72fec2b5d6d59db8ed9c30a235de098">here’s one example</a>). I staged these documents behind an MCP armed with simple vector and text search (another great use case for <a href="https://www.trychroma.com/">Chroma</a>), and wired it up to Claude with intructions to both browse the web and use the affiliate tool to assemble recommended products for my queries. Over and over, the affiliate listings would be richer, more descriptive, and would appear more often. I suspect this is because the data had been prepped, and that ease delivered better results. <a href="#fnref:try" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Drew Breunig</name></author><category term="ai" /><category term="chatbots" /><category term="advertising" /><category term="product" /><category term="design" /><summary type="html"><![CDATA[How do you add advertising to a tool used to make decisions?]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.dbreunig.com/img/chatgpt_ad.jpg" /><media:content medium="image" url="https://www.dbreunig.com/img/chatgpt_ad.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Building Castles in the Air, but With Surprise Physics</title><link href="https://www.dbreunig.com/2025/08/21/castles-in-the-air-with-unknown-physics.html" rel="alternate" type="text/html" title="Building Castles in the Air, but With Surprise Physics" /><published>2025-08-21T15:36:00-07:00</published><updated>2026-02-25T15:11:53-08:00</updated><id>https://www.dbreunig.com/2025/08/21/castles-in-the-air-with-unknown-physics</id><content type="html" xml:base="https://www.dbreunig.com/2025/08/21/castles-in-the-air-with-unknown-physics.html"><![CDATA[<p><img src="/img/hugo_castle.jpg" alt="" /></p>

<p>In the software engineering classic, “<a href="https://amzn.to/4mLkyL3">The Mythical Man-Month</a>,” Frederick P. Brooks Jr. wrote:</p>

<blockquote>
  <p>The programmer, like the poet, works only slightly removed from pure thought-stuff. He builds his castles in the air, from air, creating by exertion of the imagination. Few media of creation are so flexible, so easy to polish and rework, so readily capable of realizing grand conceptual structures.</p>
</blockquote>

<blockquote>
  <p>Yet the program construct, unlike the poet’s words, is real in the sense that in moves and works, producing visible outputs separate from the construct itself. It prints results, draws pictures, produces sounds, moves arms. The magic of myth and legend has come true in our time. One types the correct incantation on the keyboard, and a display screen comes to life, showing things that never were nor could be.</p>
</blockquote>

<p>I used to see this quote more often. It was frequently cited by developers during the period after the <a href="https://en.wikipedia.org/wiki/Dot-com_bubble">dot-com bust</a>, when the iPhone kicked off the smartphone boom and the internet and social media became <em>normal</em>. Suddenly, the real-world impact of programmers was everywhere, experienced by seemingly everyone.</p>

<p>In 2012, just prior to their IPO, Meta CTO <a href="https://x.com/boztank">Andrew Bosworth</a> had it printed on his business card<sup id="fnref:bozcard"><a href="#fn:bozcard" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>.</p>

<hr />

<h3 id="the-probabilistic-nature-of-building-atop-ai">The Probabilistic Nature of Building Atop AI</h3>

<p>During recent conversations with <a href="https://x.com/jeffreyhuber">Jeff Huber</a> and <a href="https://x.com/jxnlco">Jason Liu</a>, we touched on the <em>probabilistic</em> nature of building atop AI.</p>

<p>Randomness is built into LLMs (<a href="https://www.dbreunig.com/2025/04/18/the-wisdom-of-artificial-crowds.html#:~:text=Generating%20Diversity%20With%20High%20Temperatures">they even expose a parameter to tweak it</a>) and our agents, applications, and pipelines must account for the unexpected. This is different than the programming of the past decades. It’s a workflow more akin to that of data science, where you form hypotheses, design experiments, and rapidly iterate until you’re on (relatively) stable ground.</p>

<p>Or, <a href="https://x.com/jeffreyhuber/status/1958648713041305906">as Jeff put it</a>, “People who are good at AI are used to getting mugged by the fractal complexity at reality.”</p>

<p>Just yesterday, <a href="https://giansegato.com/essays/probabilistic-era">Gian Segato wrote an excellent piece exploring this exact shift</a>:</p>

<blockquote>
  <p>We are no longer guaranteed what x is going to be, and we’re no longer certain about the output of y either, because it’s now drawn from a distribution…</p>
</blockquote>

<blockquote>
  <p>Stop for a moment to realize what this means. When building on top of this technology, our products can now succeed in ways we’ve never even imagined, and fail in ways we never intended.</p>
</blockquote>

<blockquote>
  <p>This is incredibly new, not just for modern technology, but for human toolmaking itself. Any good engineer will know how the Internet works: we designed it! We know how packets of data move around, we know how bytes behave, even in uncertain environments like faulty connections. Any good aerospace engineer will tell you how to approach the moon with spaceships: we invented them! Knowledge is perfect, a cornerstone of the engineering discipline. If there’s a bug, there’s always a knowable reason: it’s just a matter of time to hunt it down and fix it.</p>
</blockquote>

<p><a href="https://giansegato.com/essays/probabilistic-era">You should grab a coffee and read the whole essay</a>.</p>

<p>As someone who would never call themselves an <em>engineer</em>, that last line felt true to me. AI development feels more akin to <em>science</em> (where we poke things and note how they work) than <em>engineering</em> (where we build structures with documented parameters).</p>

<p>But then a Hacker News user named “potatolicious” wrote <a href="https://news.ycombinator.com/item?id=44976929#44978319">this comment</a>, on a thread related to my <a href="https://www.dbreunig.com/2025/08/21/a-guide-to-ai-titles.html">AI job title guide</a>:</p>

<blockquote>
  <p>Most classical engineering fields deal with probabilistic system components all of the time. In fact I’d go as far as to say that inability to deal with probabilistic components is disqualifying from many engineering endeavors.</p>
</blockquote>

<blockquote>
  <p>Process engineers for example have to account for human error rates. On a given production line with humans in a loop, the operators will sometimes screw up. Designing systems to detect these errors (which are highly probabilistic!), mitigate them, and reduce the occurrence rates of such errors is a huge part of the job.</p>
</blockquote>

<blockquote>
  <p>Likewise even for regular mechanical engineers, there are probabilistic variances in manufacturing tolerances. Your specs are always given with confidence intervals (this metal sheet is 1mm thick +- 0.05mm) because of this. All of the designs you work on specifically account for this (hence safety margins!). The ways in which these probabilities combine and interact is a serious field of study.</p>
</blockquote>

<blockquote>
  <p>Software engineering is unlike traditional engineering disciplines in that for most of its lifetime it’s had the luxury of purely deterministic expectations. This is not true in nearly every other type of engineering.</p>
</blockquote>

<blockquote>
  <p>If anything the advent of ML has introduced this element to software, and the ability to actually work with probabilistic outcomes is what separates those who are serious about this stuff vs. demoware hot air blowers.</p>
</blockquote>

<p>I’ll be thinking about this for quite some time.</p>

<p><a href="https://x.com/lateinteraction/status/1958653011317669975">Omar Khattab</a> pointed out this isn’t <em>entirely</em> new:</p>

<blockquote>
  <p>Any software systems that made network requests had these properties. Honestly, any time you called a complex function based on its declared contract rather than based on understanding it procedurally you engaged in the kind of reasoning needed to build AI systems.</p>
</blockquote>

<p>This is true. But I argue that simulating network issues and designing concurrent systems is a step or two down from the variability of AI models. Further, these existing issues were never <em>the</em> dominant trend in the software engineering industry. Many developers just offloaded these challenges or avoided dealing with them.</p>

<p>Further, for each new app you need to understand the probabilistic fingerprint of that domain, for a given model. The uncertainty is a moving target, which has to be discovered every time.</p>

<hr />

<h3 id="updating-brooks-quote-for-applied-ai">Updating Brooks’ Quote for Applied AI</h3>

<p>Perhaps we should update the Fred Brooks quote, for those building atop AI: <strong>programmers still build castles in the air, but the first have to discover what physics apply</strong>.</p>

<hr />

<form action="https://buttondown.com/api/emails/embed-subscribe/dbreunig" method="post" target="popupwindow" onsubmit="window.open('https://buttondown.com/dbreunig', 'popupwindow')" class="embeddable-buttondown-form">
  <label for="bd-email">Enter your email to receive the occasional update.</label>
  <div class="form-input">
    <input type="email" name="email" id="bd-email" placeholder="your@email.com" />
    <input type="submit" value="Subscribe" />
  </div>
</form>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:bozcard">
      <p>It may have been there earlier and/or later, that’s just when I saw it. <a href="#fnref:bozcard" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Drew Breunig</name></author><category term="AI" /><category term="Engineering" /><category term="Programming" /><summary type="html"><![CDATA[How does the probabilistic nature of AI change the nature of computer engineering?]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.dbreunig.com/img/hugo_castle.jpg" /><media:content medium="image" url="https://www.dbreunig.com/img/hugo_castle.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Making Sense of AI Job Titles</title><link href="https://www.dbreunig.com/2025/08/21/a-guide-to-ai-titles.html" rel="alternate" type="text/html" title="Making Sense of AI Job Titles" /><published>2025-08-21T12:01:00-07:00</published><updated>2026-02-25T15:11:53-08:00</updated><id>https://www.dbreunig.com/2025/08/21/a-guide-to-ai-titles</id><content type="html" xml:base="https://www.dbreunig.com/2025/08/21/a-guide-to-ai-titles.html"><![CDATA[<h3 id="a-cheat-sheet-for-job-titles-in-the-ai-ecosystem">A Cheat Sheet for Job Titles in the AI Ecosystem</h3>

<p>Even when you live and breathe AI, the job titles can feel like a moving target. I can only imagine how mystifying they must be to everyone else.</p>

<p>Because the field is actively evolving, <a href="https://www.dbreunig.com/2024/08/22/where-the-future-is-being-made.html">the language we use keeps changing</a>. Brand new titles appear overnight or, worse, one term means three different things at three different companies.</p>

<p>This is my best attempt at a “Cheat Sheet for AI Titles.” I’ll try to keep it updated as the jargon shifts, settles, or fades away. As always, <a href="https://www.dbreunig.com/contact.html">shoot me a note</a> with any additions, updates, thoughts, or feedback.</p>

<hr />

<h3 id="the-ai-job-title-decoder-ring">The AI Job Title Decoder Ring</h3>

<p>While collecting examples of titles from job listings, Twitter bios, and blogs, a pattern emerged: nearly all AI job titles are created by mixing-and-matching a handful of terms. Organizing the Post-Its on my wall, I was reminded of “mix-and-match” children’s books:</p>

<p><img src="/img/dino_mix_match.jpg" alt="A children's mix-and-match book with dinosaurs" /></p>

<p>If we swap out the dinosaur parts above with the adjectives and nouns from my collected examples, we get:</p>

<p><img src="/img/ai_job_title_mix_and_match.jpg" alt="" /></p>

<p>Sliding these columns up and down, we can assemble most AI job titles. (Though I have yet to see some combinations, like, “Applied AI Ops”.)</p>

<p>Let’s first break down the <strong>modifiers</strong>:</p>

<ul>
  <li><strong>Forward Deployed:</strong> People who work closely with customers, helping them develop new applications powered by their own company’s technologies. They learn their customer’s business, constraints, and goals, then translate that context directly into features, integrations, and working code.</li>
  <li><strong>Applied:</strong> People who conceive, design, support, and/or build products and features powered by AI models. The key here is that they are <em>applying</em> AI to a domain problem; they are not helping build the AI itself.</li>
</ul>

<p>There is plenty of overlap here: most <em>Forward Deployed</em> workers are working on <em>Applied</em> problems. They usually aren’t training new models with the customer.</p>

<p>The <strong>domain</strong> column is rather awkward, mostly for historical reasons.</p>

<p>The terms “ML” and “Gen AI” are subsets of the broader “AI” domain. “Gen AI” as a term only arrived <em>after</em> ChatGPT launched, as a way to distinguish the now-famous chatbots and image generation from <em>everything else</em> people with “AI” titles had been working on prior to September, 2022<sup id="fnref:genai"><a href="#fn:genai" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>.</p>

<p><img src="/img/gen_ai_trends.jpg" alt="According to Google Trends, the term &quot;Generative AI&quot; wasn't in use until after ChatGPT arrive." /></p>

<p>While initially coined to cordon off text and image generation applications, I think “Gen AI’s” utility is waning. LLMs are being used for non-generative applications – like categorization, information gathering, comparisons, data extraction, and more – that were traditionally the domain of what we used to call “machine learning” and “deep learning”<sup id="fnref:mldl"><a href="#fn:mldl" class="footnote" rel="footnote" role="doc-noteref">2</a></sup>.</p>

<p>That said, when you see these <strong>domains</strong> in a title, here’s how you should interpret them:</p>

<ul>
  <li><strong>AI:</strong> A general, catch-all domain for people working in AI. Encompasses text processing, agent building, image generation systems, chatbots, LLM training, and so much more. This is the default for this field.</li>
  <li><strong>ML:</strong> ML signifies this role will be focused on training models – most likely <em>not</em> LLMs – for single-purpose tasks, that will be used as a function in a larger pipeline or app. Examples of these single-purpose tasks include recommendation systems, anomaly detection, predictive analytics, and data extraction or enrichment.</li>
  <li><strong>Gen AI:</strong> This domain signals that the role will involve working with text, image, audio, or video generation models. This role usually involves applications where the model output is directly consumed by the user. Examples of these applications include writing tools and image generators and editors.</li>
</ul>

<p>The <strong>suffixes</strong> are mostly self explanatory, with one exception: <em>researcher</em>.</p>

<p><img src="/img/elon_researcher_post.jpg" alt="Elon Musk wrote, &quot;Researcher is a relic term from academia.&quot;" /></p>

<p>I agree <a href="https://x.com/elonmusk/status/1950254103474446728">with the above take</a>.</p>

<p>Prior to ChatGPT, most people working on AI research and development were at universities. When private projects began standing up AI efforts, the terms “researcher” and “lab” were borrowed from academia. At first, this made sense: the work was exploratory and speculative, more akin to <a href="https://en.wikipedia.org/wiki/Big_science">big science</a> projects than product development. But as AI became a <em>product</em>, a <em>business</em>, the term “researcher” remains but is increasingly awkward.</p>

<p>“Researcher” is a title used inconsistently. I have met “researchers” with product OKRs and incentives tied to business goals. I have met “researchers” who are working on novel LLM architectures and “researchers” who are building applications atop existing models. I have met “researchers” who are doing, well, <em>research</em>: exploratory work where it’s okay if a hypothesis doesn’t pan out, so long as you’re learning. Tension behind the term is increasing, hence the Elon post above.</p>

<p>Adding to the confusion is you’ll often see the term “Scientist” in place of “Researcher”. As far as I can tell, based on job descriptions, these terms are largely interchangeable.</p>

<hr />

<h3 id="examples-of-ai-job-titles">Examples of AI Job Titles</h3>

<p>Below is a handful of illustrative, real-world job titles. This list is in no way exhaustive. The goal here is to demonstrate how the <strong>modifiers</strong>, <strong>domains</strong>, and <strong>roles</strong> are assembled so we can better decode titles when we encounter them in the wild.</p>

<div class="sidenote">

<h2>Example Titles</h2>

<ul>
    <li><a href="#ai-researcher">AI Researcher</a></li>
    <li><a href="#applied-ai-engineer">Applied AI Engineer</a></li>
    <li><a href="#applied-ai-solution-architect">Applied AI Solution Architect</a></li>
    <li><a href="#ai-forward-deployed-engineer">AI Forward Deployed Engineer</a></li>
    <li><a href="#ai-engineer">AI Engineer</a></li>
</ul>

</div>

<h4 id="ai-researcher">AI Researcher</h4>

<p>An AI Researcher forms hypotheses, designs and runs experiments to test their hypotheses, then shares their learnings (sometimes publically) in pursuit of advancing the development of AI models. Often, they’re involved in productizing their findings.</p>

<p>Perhaps the most discussed job title of late, thanks to <a href="https://www.nytimes.com/2025/07/31/technology/ai-researchers-nba-stars.html">Meta’s aggressive hiring</a>, leading to a <a href="https://trends.google.com/trends/explore?date=today%205-y&amp;geo=US&amp;q=AI%20researcher&amp;hl=en">surge of interest</a>.</p>

<p>Here’s <a href="https://web.archive.org/web/20250430105458/https://openai.com/careers/research-scientist/">job description for a “Research Scientist” from OpenAI</a>:</p>

<blockquote>
  <p>As a Research Scientist here, you will develop innovative machine learning techniques and advance the research agenda of the team you work on, while also collaborating with peers across the organization. We are looking for people who want to discover simple, generalizable ideas that work well even at large scale, and form part of a broader research vision that unifies the entire company.</p>
</blockquote>

<p>Requirements for the job include:</p>

<ul>
  <li>“Have a track record of coming up with new ideas or improving upon existing ideas in machine learning, demonstrated by accomplishments such as first author publications or projects.”</li>
  <li>“Possess the ability to own and pursue a research agenda, including choosing impactful research problems and autonomously carrying out long-running projects.”</li>
</ul>

<p>Interestingly, this job posting has been active, unchanged, <a href="https://web.archive.org/web/20230320023420/https://openai.com/careers/research-scientist/">since March of 2023</a>.</p>

<p>Sometimes you’ll see this role listed as a “<a href="https://web.archive.org/web/20250821165015/https://job-boards.greenhouse.io/deepmind/jobs/7135585">Research Scientist</a>.”</p>

<h4 id="applied-ai-engineer">Applied AI Engineer</h4>

<p>An Applied AI Engineer develops applications and features that utilize AI models.</p>

<p>Here’s <a href="https://web.archive.org/web/20250821181756/https://job-boards.greenhouse.io/deepmind/jobs/6705908">a job description for a Senior Applied AI Engineer from Google DeepMind</a>:</p>

<blockquote>
  <p>We are seeking a Senior Applied AI Engineer to lead the development and deployment of novel applications, leveraging Google’s generative AI models. This role focuses on rapidly developing new features, and working across partner teams to deliver solutions, and maximize impact for Google and top customers. You will be instrumental in translating cutting-edge AI research into real-world products, and demonstrating the capabilities of latest-generation models. We are looking for engineers with a strong track record of building and shipping AI-powered software, ideally with experience in early-stage environments where they have contributed to scaling products from initial concept to production. The ideal candidate will be motivated by the opportunity to drive product &amp; business impact.</p>
</blockquote>

<p>Note the focus on <em>applying</em> AI technology, not developing it. If we were to drop the “Applied” title, we might find an “AI Engineer” working on producing the models themselves.</p>

<h4 id="applied-ai-solution-architect">Applied AI Solution Architect</h4>

<p>Swapping out the <strong>role</strong> from “Engineer” to “Solution Architect” yields a predictable definition.</p>

<p>An Applied AI Solution Architect helps customers and potential customers design and ideate features and applications powered by AI models.</p>

<p>Here’s <a href="https://web.archive.org/web/20250702214721/https://job-boards.greenhouse.io/anthropic/jobs/4461444008">a recent job description from Anthropic</a>:</p>

<blockquote>
  <p>As an Applied AI team member at Anthropic, you will be a Pre-Sales architect focused on becoming a trusted technical advisor helping large enterprises understand the value of Claude and paint the vision on how they can successfully integrate and deploy Claude into their technology stack. You’ll combine your deep technical expertise with customer-facing skills to architect innovative LLM solutions that address complex business challenges while maintaining our high standards for safety and reliability.</p>
</blockquote>

<blockquote>
  <p>Working closely with our Sales, Product, and Engineering teams, you’ll guide customers from initial technical discovery through successful deployment. You’ll leverage your expertise to help customers understand Claude’s capabilities, develop evals, and design scalable architectures that maximize the value of our AI systems.</p>
</blockquote>

<p>If you successfully sell a client on a business case for a feature, you might call in our next role…</p>

<h4 id="ai-forward-deployed-engineer">AI Forward Deployed Engineer</h4>

<p>An AI Forward Deployed Engineer (FDE) is a professional services role that helps customers impliment AI-powered applications and featured.</p>

<p>After claiming <a href="https://a16z.com/ai-wedges-will-help-startups-outmaneuver-incumbents/">rapidly-iterating AI will companies will squeeze out incumbents like Salesforce</a>, a16z backtracked and <a href="https://a16z.com/services-led-growth/">heralded FDEs as critical roles needed for enterprise AI adoption</a>: “Enterprises buying AI are like your grandma getting an iPhone: they want to use it, but they need you to set it up.”</p>

<p>For the irony’s sake, here’s <a href="https://web.archive.org/web/20250821184027/https://careers.salesforce.com/en/jobs/jr307174/ai-forward-deployed-engineer-senior-leadprincipal/">a recent AI Forward Deployed Engineer role at Salesforce</a>:</p>

<blockquote>
  <p>We’re looking for a highly accomplished and senior-level Forward Deployed Engineer with 5+ years of experience to lead the charge on complex AI agentic deployments. This role demands a seasoned technologist and strategic partner who can not only design and develop bespoke solutions leveraging our Agentforce platform and other cutting-edge technologies but also lead technical engagements and mentor junior peers. You’ll be the primary driver of transformative AI solutions, operating with deep technical mastery, unparalleled problem-solving prowess, and a relentless focus on delivering tangible value in dynamic, real-world environments, from initial concept to successful deployment and ongoing optimization.</p>
</blockquote>

<blockquote>
  <p>As a Forward Deployed Engineer, you’ll be at the forefront of bringing cutting-edge AI solutions to our most strategic clients. This isn’t just about coding; <strong>it’s about deeply understanding our customers’ most complex problems</strong>, architecting sophisticated solutions, and leading the end-to-end technical delivery of innovative, impactful solutions that leverage our Agentforce platform and beyond.</p>
</blockquote>

<p>Emphasis mine. Rapidly acquiring domain expertise is key for this role.</p>

<p>We’ve recently written about <a href="https://www.dbreunig.com/2025/08/08/how-ai-coding-changes-product.html">Forward Deployed Engineers</a> – why they’re necessary and how they signal AI-assisted coding’s impact on product management.</p>

<h4 id="ai-engineer">AI Engineer</h4>

<p>Remove the “Forward Deployed” and we have a signficantly different job. Nailing this title down is difficult, it’s somehow <em>more</em> vague than even “Researcher” titles, ranging the gamut from “Applied” work to foundational model building. This squishiness is explored well by Latent Space, in a 2023 piece, “<a href="https://www.latent.space/p/ai-engineer">The Rise of the AI Engineer</a>.” They write:</p>

<blockquote>
  <p>I think software engineering will spawn a new subdiscipline, specializing in applications of AI and wielding the emerging stack effectively, just as “site reliability engineer”, “devops engineer”, “data engineer” and “analytics engineer” emerged.</p>
</blockquote>

<blockquote>
  <p>The emerging (and least cringe) version of this role seems to be: AI Engineer.</p>
</blockquote>

<blockquote>
  <p>Every startup I know of has some kind of #discuss-ai Slack channel. Those channels will turn from informal groups into formal teams, as Amplitude, Replit and Notion have done. The thousands of Software Engineers working on productionizing AI APIs and OSS models, whether on company time or on nights and weekends, in corporate Slacks or indie Discords, will professionalize and converge on a title - the AI Engineer.</p>
</blockquote>

<p>The entire piece is worth a read, though with the advantage of hindsight, their definition of “AI Engineering” seems very broad. As defined in their post, everything besides “Research”, “Product Manager”, and “Solution Architect” could fit within their definition.</p>

<p>The emergence of the “Applied” modifier has tightened this domain and is being leaned on more. I suspect “AI Engineering” will persist as a big-tent term for conferences and communities, but “Applied” roles will be the corporate title.</p>

<p>Try to search for “AI Engineering” titles and you’ll find jobs that are “Applied” roles; roles that build apps atop AI models, not build the models themselves. At the big labs, “AI Engineering” titles don’t exist on their career pages. For them, “Engineering” roles are specific to a domain, like performance, tokenization, infrastructure, or inference.</p>

<hr />

<p>If you run into any interesting titles that make or break the decoder ring above, please do <a href="https://www.dbreunig.com/contact.html">share them with me</a>. As novel ones float by, I may grab them and update the examples above.</p>

<hr />

<form action="https://buttondown.com/api/emails/embed-subscribe/dbreunig" method="post" target="popupwindow" onsubmit="window.open('https://buttondown.com/dbreunig', 'popupwindow')" class="embeddable-buttondown-form">
  <label for="bd-email">Enter your email to receive the occasional update.</label>
  <div class="form-input">
    <input type="email" name="email" id="bd-email" placeholder="your@email.com" />
    <input type="submit" value="Subscribe" />
  </div>
</form>

<style>
h4 {
  font-weight: 700;
  padding-top: 4rem;
}
</style>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:genai">
      <p>The term “Generative AI” is a pet peeve of mine. A weird theory of mine is that the term was coined by people running “AI” departments in large companies and consultancies. Upon seeing ChatGPT, their bosses or customers suddenly remembered they <em>had</em> people working on “AI” and promptly called them up, asking why they hadn’t made anything like ChatGPT. “AI is a big domain!” I imagine the AI departments replied. “ChatGPT is actually a subfield of AI called <em>generative AI</em>. We, too, can work on that if you want.” <a href="#fnref:genai" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:mldl">
      <p>A decade ago, the terms “machine learning” and “deep learning” were inconsistently used. When writing about a topic that applied to both, we’d all lean on “ML/DL” or similar composites to fend off the pedants in the comments section. <a href="https://www.dbreunig.com/2016/06/23/the-business-implications-of-machine-learning.html">Or just include notes about usage up front</a>. <a href="#fnref:mldl" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Drew Breunig</name></author><category term="AI" /><category term="Organizational Structure" /><category term="Job Titles" /><category term="Language" /><summary type="html"><![CDATA[AI job titles are a confusing mess of mix-and-match terms. This decoder ring breaks down the patterns behind titles.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.dbreunig.com/img/ai_job_title_mix_and_match.jpg" /><media:content medium="image" url="https://www.dbreunig.com/img/ai_job_title_mix_and_match.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Bottleneck or Bisect: AI-Assisted Coding Will Change Product Management</title><link href="https://www.dbreunig.com/2025/08/08/how-ai-coding-changes-product.html" rel="alternate" type="text/html" title="Bottleneck or Bisect: AI-Assisted Coding Will Change Product Management" /><published>2025-08-08T09:13:00-07:00</published><updated>2026-02-25T15:11:53-08:00</updated><id>https://www.dbreunig.com/2025/08/08/how-ai-coding-changes-product</id><content type="html" xml:base="https://www.dbreunig.com/2025/08/08/how-ai-coding-changes-product.html"><![CDATA[<h3 id="product-management-will-be-split-between-slow-platform-and-fast-app-modes">Product Management Will Be Split Between ‘Slow Platform’ and ‘Fast App’ Modes</h3>

<p><img src="/img/bisect_head.jpg" alt="Head from a Statue of King Amenhotep I, via The Met" /></p>

<p>When <a href="https://www.theinformation.com/articles/openai-takes-page-palantir-doubles-consulting-services">OpenAI announced they were building a consulting service</a> staffed with “forward deployed engineers” — a term Palantir popularized<sup id="fnref:fde"><a href="#fn:fde" class="footnote" rel="footnote" role="doc-noteref">1</a></sup> — the AI ecosystem took notice.</p>

<p>The FDE trend is a symptom with two underlying causes.</p>

<p>First, onsite dev is an expected capacity for companies selling new categories, where you have to teach your customers <em>how</em> to use your product <em>after</em> they buy it (ideally, before the contract runs out).</p>

<p>Second, and more significant for the broader software industry, FDEs represent a workaround for a growing problem. AI-assisted engineers can code 2-5x faster, but product management work hasn’t accelerated at the same pace. Rather than wait for traditional PM processes, organizations are empowering hybrid engineer-PMs to build directly with customers. To maintain relevance and continue to help their companies ship stable, safe, successful products, product managers and organizational structures need to adapt.</p>

<p>Let’s look at both these causes, in order.</p>

<hr />

<h3 id="1-businesses-building-new-categories-have-to-teach">1. Businesses Building New Categories Have to Teach</h3>

<p>Every product falls into one of two categories: those that sell into existing budget lines and those that must create new budget lines.</p>

<p>If you’re a startup with a product in the former category, your playbook is <em>relatively</em> simple. Your product must outperform the incumbent – with better features, services and/or reduced cost – while your go-to-market team needs to deploy marketing and sales tactics to break through the noise and close deals.</p>

<p>If your product falls into the latter category, your task is much trickier. No budget line exists for your product, which means <em>one has to be created</em>. You need to cultivate a champion at your prospective customer, someone who is going to either do the hard work of justifying the new budget (if you’re going bottom-up) or an executive who is going to add the budget line by edict (if you’re going top-down). When I was at PlaceIQ, we utilized both approaches.</p>

<p>To find your champions, you need to market your new category, defining and proving its value.</p>

<p>And the hard work doesn’t end after you create a budget line and close the deal. Now you have to <em>teach</em> your customer how to use your product. For simpler products, this can look like textbook customer success services. But for more complicated products, especially those customers use to <em>build</em> something with (like AI APIs), this looks like high-touch consulting.</p>

<p>If this weren’t hard enough, there’s a time limit for success. You gotta help them build value <em>before</em> the contract runs out.</p>

<p>OpenAI’s enterprise products are clearly in the latter of our two buckets. They, and all enterprise AI vendors, can’t count on their customers to build innovative applications with AI. They’re going to have to help them. Which is why there’s plenty of <a href="https://www.linkedin.com/jobs/search-results/?keywords=Forward%20Deployed%20Engineer">new “Forward Deployed Engineer” openings</a>.</p>

<p><img src="/img/fde_trend.jpg" alt="Google Trends for 'Forward Deployed Engineer' are skyrocketing." /></p>

<hr />

<h3 id="2-ai-reshapes-product-management-because-it-speeds-up-development-iteration">2. AI Reshapes Product Management Because it Speeds Up Development Iteration</h3>

<p>To understand how AI redefines the product management role, let’s first look at how AI-assisted coding is changing our development cycles.</p>

<p>Of course: standard caveats apply. AI-assisted coding is unevenly distributed. Many companies are slow to adopt these new tools. Many engineers haven’t had the time or motivation to explore and learn new AI development patterns. Models are great at some languages and tasks, and bad at others.</p>

<p>Yet: every startup I’ve met that has built products in the agentic era demonstrate the following:</p>

<ol>
  <li><strong>AI makes coding faster.</strong> My anecdotal experience and observations align with <a href="https://simonwillison.net/2025/Aug/6/not-10x/">Simon’s</a>, “I’ve estimated that LLMs make me 2-5x more productive on the parts of my job which involve typing code into a computer, which is itself a small portion of that I do as a software engineer.”
    <ul>
      <li><strong>AI makes prototyping faster.</strong> The product managers and engineers I know who’ve embraced AI will quickly develop frontend demos rather than multipage specs. As we learned during the Agile era, giving collaborators, customers, and users something to react to results in better and more efficient feedback.</li>
      <li><strong>Iteration is faster.</strong> This is a result of the previous two bullets. <a href="https://blog.codinghorror.com/boyds-law-of-iteration/">Faster iteration builds better products</a>. <a href="https://x.com/sama/status/1046933639726956544?lang=en">Teams that ship more win</a></li>
    </ul>
  </li>
</ol>

<p>When you ship more, you have more opportunities to learn. Your rate of improvement increases and <a href="https://stripe.com/guides/indexing-the-ai-economy">you accelerate away</a>.</p>

<p>But it’s the <em>coding</em> that is driving this speed. Everything else is seeing much smaller, if any, acceleration from AI.</p>

<p>At a recent Y Combinator <em>Startup School</em>, <a href="https://www.youtube.com/watch?v=RNJCfif1dPY&amp;t=1123s">Andrew Ng put his finger on this dynamic</a>:</p>

<blockquote>
  <p>While engineers are becoming much faster, I don’t see product management work – designing what to build – becoming faster at the same speed as engineers. I’m seeing [the product management to engineering] ratio shift.</p>
</blockquote>

<blockquote>
  <p>Literally yesterday, one of my teams came to me, when we’re planning headcount. This team proposed to me not to have 1 PM to 4 engineers but to have 1 PM to 0.5 engineers. I still don’t know if this proposal is a good idea, but it’s a sign of where the world is going. And I find that PMs that can code or engineers with some product instincts often end up doing better.</p>
</blockquote>

<p>With the pace of coding accelerating, <em>product has become the new bottleneck</em>.</p>

<p>Teams are now chasing hybrids – product managers who can code and engineers with product instincts. The explosion in Forward Deployed Engineer roles demonstrates this. FDEs are essentially product-minded engineers – hybrids who can both build and understand customer problems. Their rapid emergence isn’t just about teaching new categories; it’s early proof that organizations are gravitating toward these dual-skilled roles.</p>

<p>While we’re seeing FDEs emerge at AI labs, this shift will affect every software company leveraging AI-assisted coding – whether they’re building AI products or traditional applications with AI-enhanced development cycles. The catalyst isn’t AI products themselves, but AI tools that dramatically accelerate the coding portion of product development. By embedding these incredibly valuable hybrids <em>with</em> <a href="https://www.dbreunig.com/2025/04/10/the-domain-experts-are-drivers.html">domain experts</a>, you increase the surface area of your fast iteration loop.</p>

<p>So do product managers go away?</p>

<p>No, there’s still plenty of product work to be done to graduate a product from a rapidly assembled app or feature to a robust service: research, compliance, sales ops, product marketing, and more. Further, a FDE focused on building for one client doesn’t have the time to take in the bigger picture, both qualitatively and quantitatively. And if companies over-optimize for speed, skipping traditional product management steps, they’ll eventually get burned.</p>

<p>Take last week’s GPT-5 launch, <a href="https://www.theverge.com/news/756980/openai-chatgpt-users-mourn-gpt-5-4o">where OpenAI had to manage a mountain of upset 4o users who had developed an emotional attachment to GPT-4o’s tone</a>. Instrument ChatGPT all you want – analyze and sort consumer queries into buckets of use cases – and you’d <em>still</em> miss that a chunk of users related to 4o as a friend.</p>

<p>Rather than go away, I think the function of product management will get bisected into two domains:</p>

<ol>
  <li><strong>Application Product Managers:</strong> These roles work closely with customers or partners, and are often “hybrids”: product-minded engineers or product managers who code. They rapidly absorb domain expertise and use it to rapidly iterate products and features. This is product management and engineering, blended with customer success and consulting.</li>
  <li><strong>Foundation Product Managers:</strong> These roles build the core platform upon which <em>Application Product Managers</em> build. They design the APIs, data structures, and core business logic. Further, these core teams productize innovations developed by application product manager teams, handling compliance, security, QA, and more.</li>
</ol>

<p>Naturally, the amount of definition between these two domains is dependent on the size of the company and the nature of the product. But this set up attempts to preserve the speed of AI-assisted development, the need for traditional product management functions, and any need to teach customers how to use your product.</p>

<p>But at larger orgs, there will be many pods of forward-deployed engineers and application product managers. They’ll optimize for speed and customer utility, tossing back innovations that find traction to foundation product teams, who prepare them for wider use. The core platform, that powers the FDEs, is maintained by centralized, slower foundation teams. These core product management teams will effectively negotiate the speed difference between the fast app teams and <em>everything else</em>.</p>

<p>The speed gains from AI-assisted development and the need to <em>teach</em> companies how to use AI-powered products dictates that technology companies embrace forward-deployed work. Product management is going to have to evolve to support speedier engineers, not be a bottleneck, and help their organizations ship stable, safe products.</p>

<p>This adaptation may look like the divide I detail above, or it may look like adoption of new, AI-powered tools that grant speed buffs to traditional product work. Currently, my money’s on the former.</p>

<hr />

<form action="https://buttondown.com/api/emails/embed-subscribe/dbreunig" method="post" target="popupwindow" onsubmit="window.open('https://buttondown.com/dbreunig', 'popupwindow')" class="embeddable-buttondown-form">
  <label for="bd-email">Enter your email to receive the occasional update.</label>
  <div class="form-input">
    <input type="email" name="email" id="bd-email" placeholder="your@email.com" />
    <input type="submit" value="Subscribe" />
  </div>
</form>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:fde">
      <p>We used to call them “field engineers” or “solutions engineers”, but neither of those support the <a href="https://en.wiktionary.org/wiki/tacticool">tacticool</a> branding Palantir cultivates. <a href="#fnref:fde" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Drew Breunig</name></author><category term="Product Management" /><category term="AI" /><category term="Organizational Structure" /><summary type="html"><![CDATA[AI-assisted engineers can code 2-5x faster, but product management work hasn't accelerated at the same pace. Rather than wait for traditional PM processes, organizations are empowering hybrid engineer-PMs, FDEs, to build directly with customers.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.dbreunig.com/img/bisect_head.jpg" /><media:content medium="image" url="https://www.dbreunig.com/img/bisect_head.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>