The Rise of Spec Driven Development

It’s been a month since I launched whenwords, and since then there’s been a flurry of experiments with spec driven development (SDD): using coding agents to implement software using only a detailed text spec and a collection of conformance tests.

Github Could Use a ‘Docs Review’ UI

First off, despite whenwords being a couple Markdown docs and a YAML test set, people have submitted valuable PRs. Mathias Lafeldt spotted a disagreement about rounding, where the spec instructed the agent to round up in several scenarios, but three tests were rounding down. Others have suggested there should be some [CI][ci] (despite their being no code) and wonder what that should be.

There’s been enough action on the repo to give us an idea of what open source collaboration could look like in a SDD world. And it feels more like commenting in and marking up a Google Doc than code merges. I would love to see Github lean into this and build richer Markdown review, like Word or Google Docs, allowing for easier collaboration and accessibility to a wider audience.

Emulation & Porting are the Low-Hanging SDD Use Case

By far, the hardest part of starting a SDD project is creating the tests. Which is why many developers are opting for borrowing existing test sets or deriving by referencing a source of truth.

Here’s a few examples:

Anthropic wrote a C compiler in Rust. They used existing test suites and used GCC as a source of truth for validation and generating new tests.
Vercel created a bash emulator in TypeScript. They created and curated an amazing set of shell script spec tests and have been feeding these to Ralph. (To make this even more meta, I’ve been following their commits and Clauding them into Python).
Pydantic created a Python emulator…in Python. This sounds silly, but it’s useful in the same way Vercel’s just-bash is: it’s a super lightweight sandbox for AI agents. (In fact, I’ve already wrapped it in a CodeInterpretter for use with DSPy’s RLM module)

Now… It’s worth noting that most of these examples didn’t emerge perfectly. Anthropic’s C-compiler just kinda punted on the hard stuff and admits the generated code is inefficient¹. Pydantic’s Python emulator lacks json, typing, sys, and other standard libraries. Though I’m sure those will come soon. Vercel’s just-bash sports outstanding coverage, though people continue to find bugs.

This is the big takeaway from watching the last few weeks of SDD: agents and a pile of tests can get you really far, really fast, but for complex software they can’t get you over the line. Edge cases will generate new tests, truly hard problems will resist SDD implementation, and architectural issues will prohibit parallelism agents.

Vercel’s CTO and just-bash creator, Malte Ubl, sums it up best:

Software is free now. (Free as in puppies)

You can Ralph up a port or emulator in a weekend or two, but now you have to take care of it.

There is lots to pick apart in Anthropic’s piece (I have had multiple compiler and related people ping me about how misrepresentative it is), but the most laughable claim is that this is, “a clean-room implementation”. The idea that using an LLM trained on the entire internet, all of Github, and warehouses full of books is a clean room environment is absurd. ↩