The Cost of Overfitting the Harness

OpenAI winding down fine tuning is an interesting development and one to watch.

On one hand, model maximalists will argue the largest models keep getting better at more things, so the need to adjust the weights of them is less necessary.

On the other hand, the big labs keep pushing their models to a handful of use cases while training their harness designs into the model, rendering them less generalized. There’s an argument this is fine, because coding and reasoning abilities will solve most other problems.

But what we end up with are models build for their own harnesses. Mario Zechner was wrestling with GPT in the OSS Pi harness this week, trying to wrangle out specific in-harness behaviors, with Claude fighting him every step of the way.

If this continues, there’s a world where 3rd party harnesses become less valuable when used with frontier lab models because the 1st party harness behavior is already baked in. And there’s no longer a fine tuning escape hatch to generalize this behavior away.

In this world, frontier models will resemble appliances, not general platforms¹. With their harness trained in and no ability to adjust it? This might make application building easier for some enterprises, but the trade off is lock in. For many, improved reliability will be worth it.

I’m reminded of John Siracusa’s “Naked Robotic Core” model for the iPhone, that it ideally is a common denominator device that can support many shapes of applications and interfaces. ↩