Mistral Small & Human-Centric Benchmarks

I really like small models. They’re fast, cheap, and local – the perfect foundation for fashioning a cog in a compound AI pipeline.

Today, Mistral released Mistral Small 3, “a latency-optimized 24B-parameter model released under the Apache 2.0 license.” At 20GB, Mistral Small 3 runs well on my Mac Studio with 64GB of RAM, though Mistral notes it fits, “in a single RTX 4090 or a 32GB RAM MacBook once quantized.” So far it’s performed very well for me – knocking out code questions, extraction, rephrasing, and other general tasks. It feels on-par with Llama 3.3 70B and Qwen 32B.

But what I really like is how they benchmarked the model. Here’s a screenshot of Mistral Small 3’s Hugging Face page:

Human eval benchmarks for Mistral Small 3

I love this. Rather than highlighting benchmarks irrelevant to their target audience (remember: you really should have your own eval), they’re publishing a quantified table of “vibes.” Below the chart they note:

We conducted side by side evaluations with an external third-party vendor, on a set of over 1k proprietary coding and generalist prompts.

Evaluators were tasked with selecting their preferred model response from anonymized generations produced by Mistral Small 3 vs another model.

We are aware that in some cases the benchmarks on human judgement starkly differ from publicly available benchmarks, but have taken extra caution in verifying a fair evaluation. We are confident that the above benchmarks are valid.

Translated: “Yeah, it’s imperfect, but here’s how people feel using this.” It’s a self-administered Chatbot Arena¹ (though they hired a 3rd party to execute it).

I don’t begrudge the big open, standard evals. They push our model development further by putting out crystal-clear challenges for teams to develop against. But it has feels like we’ve been over-fitting to the most popular evals lately. Rough tests like Mistral’s and Chatbot Arena at least attempt to bring some qualitative metrics to the table.

(Personally, I find DeekSeek-R1 to be a prime example of this. It nailed many metrics, leading to the dramatic headlines, but for most tasks I find myself turning to Claude or Llama 3.3, locally.)

Mistral’s vibes preference benchmark here is very welcome. It’s simple and I hope more

I’ve noticed Chatbot Arena isn’t cited or discussed as often as it was. In the past, mystery models on their generated waves of speculation and new leaders created waves of headlines. I chalk up it’s decline in popularity to the speed of the field these days – it takes time to establish a good score on Chatbot Arena, which doesn’t fit with the cadence of splashy, spikey launches. ↩