A Small Model Just for Structured Output
Osmosis-Structure-0.6B is a small model trained with reinforcement learning to do one thing well: extract structured data, typically JSON, from unformatted text. That’s it!
Convincing LLMs to consistently produce JSON or specifically tagged answers has been a headache since ChatGPT arrived – though it has been much easier these days. Tools like Ollama and others have their tricks, including appending instructions to prompts detailing a desired JSON schema. DSPy instructs the model to use consistent Markdown deliniators and lists specific output fields. If your chosen model doesn’t respond well to this tactic (it happens, especially with tiny models), there’s an option to use another LLM to extract the data with a second pass.
Osmosis-Structure-0.6B is a model built specifically to be that second pass model.
Since it’s small and optimized for this task, we can remove the entire burden of structuring output from a task we give a much larger, slower, expensive model. This is a pattern we’ve seen in compound AI pipelines (as in the DSPy two-step adapter we mentioned above), but now we can shrink the footprint and raise the accuracy of the model used to extract the information.
But the weirdest twist the Osmosis team discovered, is that using Osmosis-Structure-0.6B to handle the structured parsing dramatically improved benchmark performance for large models!
It’s a dramatic improvement, one I’d be curious to dig into. Were Sonnet, Opus, and GPT-4.1 lagging because their formatting failed? There’s some support to this, as recent a recent draft paper suggests. The authors found that recent RL-powered benchmark improvements appear to be associated with formatting improvements, noting, “If the RL-training primarily teaches the model to better work with the evaluation format — it doesn’t deliver on new reasoning abilities as we hope.” If this is the case, o3’s lack of improvement could be chalked up to OpenAI employing a double-pass with a smaller model behind the scenes (as the Osmosis team suggests) or because o3 has been trained by RL to effectively perform a double pass during reasoning.
Or perhaps the Osmosis formatting results occur because when you prompt models to deliver results in natural language (which, remember, is how the bulk of their training data appears) they’re able to find a better path to the answer, not a contrived one that has to rely on learnings they acquired during a later stage of training? If that’s the case, this could explain o3’s minimal difference: as a reasoning model it’s been trained through RL to continually revisit and recontextualize it’s thinking.
The Osmosis team hypothesizes o3’s original score may be due to a double pass, “i.e. o3 generates an output, and then 4o-mini (or another small model) is used to validate/structure the output, similar to Osmosis-JSON-06B.” If true, that’s an interesting explanation as well.
You can pull the model from Ollama or grab the weights on Hugging Face.
I’ll definately be testing this as the formatting model in a DSPy two-step adapter.