---
title: "Mistral Small & Human-Centric Benchmarks"
date: 2025-01-30
author: Drew Breunig
tags: ["mistral", "llm", "ai", "evals"]
url: https://www.dbreunig.com/2025/01/30/mistral-small-human-centric-benchmarks.html
---

I really like [small models][smaller]. They're fast, cheap, and local – the perfect foundation for fashioning [a cog][cog] in a compound AI pipeline.

Today, Mistral released [Mistral Small 3][mistral], "a latency-optimized 24B-parameter model released under the Apache 2.0 license." At 20GB, Mistral Small 3 runs well on my Mac Studio with 64GB of RAM, though Mistral notes it fits, "in a single RTX 4090 or a 32GB RAM MacBook once quantized." So far it's performed very well for me – knocking out code questions, extraction, rephrasing, and other general tasks. It feels on-par with Llama 3.3 70B and Qwen 32B.

But what I really like is how they benchmarked the model. Here's a screenshot of [Mistral Small 3's Hugging Face page][hf]: 

![Human eval benchmarks for Mistral Small 3](/img/mistral_vibes.png)

I love this. Rather than highlighting benchmarks irrelevant to their target audience (remember: you really should have [your own eval][eval]), they're publishing a quantified table of "[vibes][vibes]." Below the chart they note:

> - We conducted side by side evaluations with an external third-party vendor, on a set of over 1k proprietary coding and generalist prompts.
> - Evaluators were tasked with selecting their preferred model response from anonymized generations produced by Mistral Small 3 vs another model.
> - We are aware that in some cases the benchmarks on human judgement starkly differ from publicly available benchmarks, but have taken extra caution in verifying a fair evaluation. We are confident that the above benchmarks are valid.

Translated: "Yeah, it's imperfect, but here's how people feel using this." It's a self-administered [Chatbot Arena][chatbot arena][^cba] (though they hired a 3rd party to execute it).

[^cba]: I've noticed Chatbot Arena isn't cited or discussed as often as it was. In the past, mystery models on their generated waves of speculation and new leaders created waves of headlines. I chalk up it's decline in popularity to the speed of the field these days – it takes time to establish a good score on Chatbot Arena, which doesn't fit with the cadence of splashy, spikey launches.

I don't begrudge the big open, standard evals. They push our model development further by putting out crystal-clear challenges for teams to develop against. But it has feels like we've been over-fitting to the most popular evals lately. Rough tests like Mistral's and Chatbot Arena at least attempt to bring some qualitative metrics to the table. 

(Personally, I find DeekSeek-R1 to be a prime example of this. It nailed many metrics, leading to the dramatic headlines, but for most tasks I find myself turning to Claude or Llama 3.3, locally.)

Mistral's vibes preference benchmark here is very welcome. It's simple and I hope more 

[smaller]: https://www.dbreunig.com/2023/12/21/a-big-year-for-small-models.html
[cog]: https://www.dbreunig.com/2024/10/18/the-3-ai-use-cases-gods-interns-and-cogs.html
[mistral]: https://mistral.ai/news/mistral-small-3/
[hf]: https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501#function-calling
[eval]: https://www.dbreunig.com/2025/01/08/evaluating-llms-as-knowledge-banks.html
[vibes]: https://simonwillison.net/2023/Dec/31/ai-in-2023/#vibes-based-development
[chatbot arena]: https://lmarena.ai

---
