Drew Breunig

The Wisdom of Artificial Crowds

2025-04-18T19:50:00-07:00

Building a Distributed AI Network of Diverse, Emergent Intelligences

A few weeks ago, two papers caused me to reassess my expectations for the future of AI.

Not because they showcased some breakthrough capability or jaw-dropping new technique, but because they pointed to a different path forward — one where progress doesn’t rely solely on ever-larger models but instead on networks of smaller, diverse, and specialized intelligences.

At their core, LLMs are average machines. They predict the next token based on the sum of all the data they’ve been trained on. Much of the time, this is exactly what we want: the consensus answer to a question, derived from an internet’s worth of content.

But sometimes, we don’t want the average. Recently, we discussed how reasoning models “search” as they reason. They’ll nominate and explore multiple, potential solutions to your problem, exploring each before settling on the hopeful ideal.

For searching, we don’t want the average. We want a scattershot of diverse, yet relevant potential paths.

But if diversity is what we’re after during the “search” stage, why should we rely on only one model?

Prior to the emergence of reasoning models, people started experimenting with LLM ensemble techniques: prompting many different models with the same prompt, and then using another model to assemble a final answer from the various outputs. Squint a bit and this looks a bit like the “searching” reasoning models perform, but at the time the main driver was the fact that different models were better at different things.

Initial ensemble techniques never really found traction, likely because the gains in LLM performance and efficiency were so great from 2023 to 2025. Why complicate your pipeline when another model will arrive next week that’s better and cheaper than any of the others in your stable?

But now that we want diversity while reasoning, we should give ensembles another look.

This is not because large reasoning models aren’t capable of coming up with diverse approaches (they’re so big there’s more than enough to draw from) but because smaller models can generate an array of approaches faster for a fraction of the cost.

The price range among LLMs these days is enormous. Taking a look at Simon Willison’s handy LLM pricing page, we can see a four-magnitude difference between the most expensive and cheapest models. For a more practical comparison: OpenAI’s flagship o3 is 10x the cost of GPT-4.1 Nano¹.

Small models are also much faster. Sticking with the same OpenAI lineup comparison, GPT-4.1 Nano is over 25 times faster than o3 (~300 tokens per second compared to 11 tokens per second).

Suddenly, a pipeline where we ask a small model the same question 100 times before asking a giant model to synthesize everything looks pretty attractive! We just need something to make that task easy.

A Framework for a Network of Networks

A few weeks back, a team led by researchers from Foundry, Databricks, Stanford, and a smattering of other labs released Ember, a framework for composing tasks like the one above.

Here’s what it looks like:

# Create your pipeline
pipeline = non.build_graph(["100:E:gpt-4.1-nano:0.7", "1:J:o3:0.2"])

# Prompt the pipeline
result = pipeline(query="What causes the northern lights?")

In that first line, we built exactly the pipeline we described above. 100:E:gpt-4.1-nano:0.7 means we want an ensemble of 100 requests to gpt-4.1-nano. 1:J:o3:0.2 means we want the responses from the ensemble judged with one request to o3.

The brevity of Ember is handy, but there’s plenty of less-compact notations available to spec out complex graphs, assigning different jobs to different models, in parallel and sequential structures.

Generating Diversity With High Temperatures

The eagle-eyed among you likely noticed the trailing numbers in our graph definition, the 0.7 and 0.2. These figures set the temperature of our models.

Temperature is a parameter that controls the randomness of an LLM. Setting a low temperature instructs a model to add little to no randomness when generating tokens. Whatever is statistically the most likely token will be returned.

By setting a high temperature, we’re instructing a model to add a little randomness when picking the next token. Most of the time, it’ll pick the first or second most likely token. But sometimes it’ll dip even lower. This sounds like a small tweak, but because a model references all previous tokens when figuring out the next thing to generate, this randomness compounds.

If you’re running a local LLM, you can see this for yourself. (And if you’re not, here’s how you get started) Let’s give the following prompt to Llama 3.2, a 3 billion parameter model:

Reply with one word only. What is the next word in this sequence: My name is

If we set our temperature parameter to 0.0, Llama 3.2 always responds with, “Pete.” Regenerate the prompt again and again – you’ll always get “Pete.”

But if we raise our temperature to 0.7 and generate 10 responses, we get the following:

Percy
Patrick
Pete’s
Pete
Pete
Percy
John
Pete’s
Peterson
Pete

Now we’re getting some randomness, though we’re bouncing around the most likely next tokens (with P and Pe being the clearly top-ranked). If we dial our temperature up to 2.0, we get “Penny”, “Alexander”, and “Robert” – nicely illustrating the increased randomness.

We can now see why our temperature values were set to 0.7 and 0.2 in our Ember call. For our ensemble of 100 tiny models, we want plenty of randomness – 0.7 – to ensure the responses cover a wide surface area of possibilities. But for our large judge model, we kept our temperature very low – 0.2 – to keep it focused on logically synthesizing a single, representative response.

Generating Diversity With Fine Tuning

The day Ember launched, DataBricks’ AI research team published a blog post titled, “The Power of Fine-Tuning on Your Data: Quick Fixing Bugs with LLMs via Never Ending Learning (NEL).” While reading this piece immediately after Ember’s blog post, a vision for a diverse, distributed AI future snapped into focus.

But first: the blog post.

In the Databricks interface, there are three LLM-powered features: autocomplete, chat, and “Quick Fix.” Quick Fix is exactly what it sounds like: if a bit of code in a cell fails to run, Quick Fix will recommend a correction. The user can reject the change or accept and and run the code.

The Mosaic AI team ran an experiment where they hooked up the Quick Fix feature to GPT-4o and the much smaller Llama 3.1 8b. When backed by Llama 3.1, Quick Fix’s performance was terrible. However, when a code cell was corrected by the user, the team would use this data to train Llama 3.1.

Pretty soon, the Quick Fix trained Llama model was outperforming GPT-4o²:

And the tuned Llama performed over twice as fast:

But the real improvement is that the fine-tuned Llama learned the conventions and preferences of the Databricks team. For example, GPT-4o is quick to rewrite code in Python or load libraries you’d prefer not to use (which may not even exist!). This is a common problem for Databricks team members, who often write in SQL. GPT-4o often rewrites an entire cell as Python, loading PySpark and translating SQL lines to formatted strings. (This behavior isn’t limited to GPT-4o; Claude frequently “fixes” my DuckDB scripts by rewriting them as Python scripts using the DuckDB library.)

The Quick Fix trained Llama model doesn’t have this issue: it was trained on corrections written by the Databricks team, through the act of performing their usual work.

The fastest, cheapest, and best model’s skills are localized to the team.

Fast, Cheap, & Diverse

Which brings us back to the topic of diversity.

The Quick Fix fine-tuning pattern – where models are continually trained with corrections written by an individual programmer or team – builds better, diverse models. If this pattern were deployed across many teams and developers, we’d end up with many different yet performant models. Further, these fine-tuned models can be small, fast, cheap, private, and good.

And if these diverse models are remotely accessible, we can connect them together in a network of networks, polling them with tricky coding questions before synthesizing their diverse responses with a judge model.

Many novel applications come to mind, utilizing this pattern:

A company could have specific product teams maintain their own fine-tuned models, constantly trained by programmers working on that specific domain. The teams get the benefit of a model that doesn’t fight their localized conventions and new team members get a leg-up when starting out. But for big questions, they could poll models from every team, drawing on the diverse approaches of their individual employees.
Assisted coding apps – like Cursor or Cline – could allow users to actively fine-tune small models, hosted locally, enabling quick, cheap responses that adopt their conventions. Further, these companies could form domain “communities” that each train different models to learn their specific languages or frameworks. Such models could better adhere to a language’s conventions and more quickly adapt to new releases and APIs.

As small models continue to improve and coding assistants continue to evolve, I’m hopeful we’ll see more localized training. A potential peril of dependence on the largest models is a lack of intellectual diversity. We can hack it by training reasoning models to search, tweaking our temperatures, and improving ensemble frameworks. But what local training brings is a diversity of perspectives, which I suspect will be more valuable together than a single, large average machine.

We’ve often talked about how LLMs are improving more quickly in verifiable domains. Because we can evaluate the correctness of code with unit tests and compilers, we can generate as much synthetic coding data as we need. However, as any developer will tell you, just because code compiles doesn’t mean it’s optimal. Passing a unit test doesn’t mean the code is stable, secure, efficient, or readable. While programming is more objective than many fields, anyone who has actually worked with engineers knows there is no shortage of arguments and perspectives. A diversity of perspectives makes everybody’s arguments sharper and results in better decisions, architectures, and code.

Locally tuned small models connected with network-of-networks frameworks could preserve this diversity, enabling better reasoning and synthesis outcomes. The ability to leverage an emergent network of unruly intelligence could become another network effect.

During normal use this cost factor will be dramatically higher, as o3’s reasoning produces much longer outputs than the non-reasoning GPT-4.1 Nano. ↩
I edited these charts for clarity, removing a bar that goes beyond the scope of this article. If you’re curious, I encourage you to read the entire write up. ↩

What We Mean When We Say “Think”

2025-04-11T05:14:00-07:00

What exactly are “reasoning” and “thinking” models?

Last September, OpenAI launched its first “reasoning” model, o1. Unlike previous models, o1 “thinks” before delivering a final answer, chewing on a problem with step-by-step notes. OpenAI explained,

Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem.

This tactic delivered incredible results, at least for programming and math applications. OpenAI’s o1 model scored 6 times higher than the non-reasoning GPT-4o when answering competitive math questions and 8 times higher when solving coding challenges.

Suddenly, every AI lab was building and releasing reasoning models. Four months later, DeepSeek’s arrival made these models a household name. Today, Claude has “thinking modes” and ChatGPT has a “reason” button.

But how exactly did these models learn to “reason”? What do we mean when we say, “think”? Why do reasoning models excel at coding and math, but struggle to show improvements when venturing beyond these domains?

Today let’s look at how we arrived at reasoning models, review how they’re built, and examine their impact on the ecosystem.

Like everything else in AI, reasoning models aren’t magic. They just haven’t been clearly explained.

ToC / tl;dr

This is a longish piece. If you're familiar with the domain, feel free to skip around. If you're strapped for time, the bullets below will give you the broad strokes.

Longer Prompts Are Better Prompts: Early prompt engineers discovered that longer prompts work better, thanks to their additional details, examples, and context.
Prompting Models to Reason: "Chain of thought" prompting techniques encouraged models to provide their own context, by asking them to think step-by-step.
Training Models to Reason: We taught models to reason during training by curating and creating examples of reasoning data and scoring their reasoning processes during fine-tuning.
The Arrival of OpenAI's o1: o1 proved training models to reason delivered a step-change in performance, and demonstrated this improved if we simply let models think longer.
DeepSeek's Splash: DeepSeek's R1 made reasoning a household name thanks to a free chatbot and visible reasoning. But the team's research underlined the limitations of reasoning models.
The Strengths & Limits of Reasoning Models: Reasoning models deliver outsized performance in quantitative fields, like math and coding, but only slightly move the needle in qualitative domains.
The Impact of Reasoning Models: The rise of reasoning models shifts the share of compute towards inference and will create an AI perception gap among users.

Longer Prompts Are Better Prompts

Following the arrival of ChatGPT, many began experimenting with best practices for prompting. Often these were little hacks that resulted in better responses. A few were somewhat absurd, like promising to tip the LLM or not punish it if it produces a good answer (I’m not joking).

One of the most important discoveries was that short prompts are bad prompts. When we provide LLMs with plenty of details, context, and examples, we get better answers. There are a few ways to do this:

Provide more detailed instructions: Be explicit, detailed, and exhaustive about the task at hand. State the core task at the beginning of the prompt and reiterate it at the end.
Provide examples of ideal interactions: Give examples of ideal output given an input. Provide one or several examples, each illustrating the desired outcome in different ways. You may have heard the term “zero-shot”, “one-shot”, or “few-shot.” This is what they’re referring to; replace “shot” with “examples” and you’ve got it.
Provide additional context: Append documents, definitions, documentation, and other references to draw upon. You may have heard the term “RAG”, or “Retrieval-Augmented Generation” – this is that. LLM prompts are augmented with relevant information retrieved from a dataset.

All of tactics make prompts longer and give LLMs more explicit instructions, references, and examples to complete the task at hand, yielding better results.

Across a range of benchmarks, longer prompts perform better. There is some diminishing performance, but the optimal range for the models analyzed above is 16k to 64k tokens, which equates to 12,000 to 48,000 words (or 24 to 96 pages of single-spaced text) – dramatically longer than most casual prompts.

Why does this work? At their core, LLMs are next-token prediction machines, constantly forecasting the most likely word to come next. Crucially, they consider ALL preceding tokens in a prompt when making these predictions, not just the most recent ones. Each additional token we provide effectively reduces the “surface area” of possible interpretations, narrowing the model’s prediction space and minimizing ambiguity about the task.

It’s similar to playing 20 Questions — it’s much easier to guess the answer after 19 questions versus just one. By providing extensive context, examples, and specific instructions, we’re essentially giving the model more “questions answered” before it needs to respond, allowing it to focus its capabilities on a much more precise understanding of our intended task rather than having to make broad assumptions about what we want.

Prompting Models to Reason

When there are more tokens to work with, LLMs perform better. This doesn’t just apply to input tokens. Generated output tokens – the ones models produce as they respond to your prompts – are taken into account when determining the next token.

In fact: output tokens are treated the same as input tokens. Therefore, if a model spends more time answering it’s effectively making your prompt longer.

So if we can convince the model to expound a bit before directly answering your input we stand to get better answers.

During those aforementioned early days of prompt engineering, people discovered that asking a model to “think out loud” or “show their work” before they issue a final answer often yielded better results. You can still do this today, with non-reasoning models. Just append the magic words – “Think step by step!” – to the end of your prompt.

Mistral Small 3.1 isn’t a reasoning model, but by simply appending, “Think step by step!” to the end of our request we can get it to act – and perform – like one.

This prompting technique is called, “Chain-of-Thought Prompting” or “CoT.” It’s deployed quite often, still, in many LLM-powered pipelines. In our recent post on the importance of building your own evaluations, using chain-of-thought strategies increased Llama 3.2’s performance on our Jeopardy questions benchmark by 14%.

CoT’s ability to improve LLM performance surprised researchers. Models hadn’t been explicitly trained to think step-by-step. Yet here they were, able to string together reasoning steps and hone in on better answers.

But what if we purposely trained models to reason? Would we see similar gains?

Training Models to Reason

To understand how we might train models to reason, let’s back up and review the training stages used for most LLMs. Roughly, the process goes like this:

Pretraining establishes the foundation. Massive amounts of textual data are fed into the model to provide it with diverse knowledge and general language abilities. Models at this stage are great at predicting the next token, but if you ask it a question it won’t directly answer it; it’ll just guess the next token as if it were continuing your thought.
Fine-tuning teaches the model to converse. With the base established, we enter what is often called the “instruction tuning stage,” a fancy way to say, “let’s be sure the model replies in a way our users expect it to.” Here the model learns how to converse and interact with users. Initially, this work was done entirely by humans, but now we have large datasets that serve as examples for fine-tuning.
Alignment makes the model helpful. Our knowledgable, conversant model isn’t quite ready for the real world. The final stage, alignment, can be thought of as a finishing school for bots: training focused on making the model helpful, honest, and harmless.

During the first two stages, we can coax models to reason.

Teaching Reasoning During Pretraining

If we have a sufficient amount of reasoning examples, we can use them during pretraining. However, most content used to train LLMs doesn’t come in this format and hiring humans to create reasoning content is prohibitively slow and expensive.

Meta’s Llama and Microsoft’s Phi teams eschewed humans altogether and used LLMs to rephrase high-quality data into step-by-step reasoning formats. As we previously wrote, in our synthetic data explainer:

[The Phi-4 team] created question datasets from sources like Quora, AMAs on Reddit, or the questions LinkedIn prompts you to answer. High-quality content demonstrating, “complexity, reasoning depth, and education value,” was selected from web pages, books, scientific papers, and code. Much of Microsoft’s previous Phi work dealt with this filtering methodology. In some cases, these high-quality selections were rephrased into Q&A content, similar to the previous example. All of this went into their seed pile.

Using GPT-4o as a teacher, the team transformed the seeds into synthetic data, “through multi-step prompting workflows..rewriting most of the useful content in given passages into exercises, discussions, or structured reasoning tasks.” This synthetic dataset made up 40% of Phi-4’s training data.

Rephrasing much of their pretraining corpus to teach Phi-4 reasoning from the start paid off: with just 14 billion parameters, Phi-4 outperforms GPT-4o (which likely has hundreds of billions of parameters) on graduate-level STEM and math evaluations.

Teaching Reasoning During Fine-Tuning

During the fine-tuning phase, a technique called “Reinforcement Learning” is usually deployed. Reinforcement Learning, or RL, is a process where a model’s output is scored, and this score is used to update the model’s weights. This score can be assigned by a human reviewer (otherwise known as “Reinforcement Learning from Human Feedback”), a computer program, or another LLM.

RL was a game-changing development in the LLM space – ChatGPT’s success can directly be tied to OpenAI’s use of human reviewers to teach GPT-3 how to converse effectively. Further, as models grew in size, RL was a crucial technique for improving the performance of smaller models. Small models could effectively go to school using RL, using large models as teachers scoring their work, with dramatic results.

To adapt RL for reasoning models, researchers began scoring each step taken by a model not just the final result. Referred to as “Process Reward Model” or PRM, this tactic unlocked the reasoning models we know today.

Before PRM, reinforcement learning only scored a model’s output after it was done generating it. However with PRM, individual reasoning steps were scored as the model was working. This in-process feedback encouraged models to iterate on ideas and backtrack several steps if they arrived at a dead end. The reasoning produced when using PRMs wasn’t linear, but complex.

Three reasoning techniques emerged through reinforcement learning using process reward models:

Searching: Models learned to explore multiple, candidate solution paths to solve a single problem.
Reflection: Models learned to identify unpromising intermediate results and course correct.
Factoring: Models learned to break complex problems into simpler components.

These techniques enable models to essentially reason nearly forever, fracturing problems into subproblems, evaluating multiple approaches for each, breaking those approaches into subproblems… Etc. How long we let models think is up to the user, their budget, schedule, and the task at hand.

The Arrival of o1

The first major reasoning model trained with PRM, OpenAI’s o1, dramatically proved the efficacy of the technique. o1’s performance on math and coding benchmarks was a step change beyond previous scores:

Critically, OpenAI noted that o1’s performance improves not just through additional training but also when given more time to “think” through problems.

The chart above introduced a new concept, “test-time compute,” which has remained the focus of frontier model builders ever since.

“Test-time compute,” is a fancy way of saying, “how much time a model spends responding to a prompt.” The reasoning tactics models developed with PRMs – searching, refactoring, and factoring – enable them to think for as long as we wish. And this, OpenAI established, is just as valuable a lever for increasing model performance as pretraining.

Since o1, increasing test-time compute is the chief way AI labs have been increasing the ceiling of LLM performance.

Last December, OpenAI released their o3 model, which delivered a step-change result compared to o1 on ARC-AGI, a notoriously hard AI reasoning benchmark:

o3 was a step up, but the key to victory here was simply letting these models run for incredible durations. The yellow “o3 Low” dot on the chart above cost $6,677 to run. OpenAI didn’t share the costs for the “High” dot, but an ARC-AGI blog post says it used 172x more compute than the “Low” dot (implying a cost of more than $1 million).

DeepSeek’s Splash

If o1 proved to the LLM ecosystem that reasoning was worth pursuing, DeekSeek sold the idea to the masses. Partially this was due to the accessability of the model; trying it out was free and easy. Partially this was due to the UI for DeepSeek’s chatbot; the reasoning tokens were visible and streamed out as they were generated, allowing users to watch the model work its way through a problem. o1, on the other hand, remained behind a paywall and hid its reasoning from users.

How DeepSeek delivered an incredible model so cheaply¹ is beyond the scope of this article, but DeepSeek’s reasoning model is worth reviewing because the team published their training methodology, which revealed some novel tactics and insights into how reasoning models are constructed.

Unlike OpenAI, the DeepSeek team did not use process reward models (PRMs) during reinforcement learning. Instead, to build their R1-Zero model they relied solely on reinforcement learning that judged the final output of the model. Left to its own devices, the R1-Zero model learned that longer answers were more likely to result in correct answers. As training progressed, the model increased the time it spent on each problem:

In the above image, we can see the average length of the model’s responses continually increased as it went through additional training. Rather than fine-tuning their base model with examples of reasoning data, the DeepSeek team just used RL and the model brute-forced it.

So why didn’t DeepSeek use PRMs? The team listed a few reasons, but primarily it was because they didn’t use other LLMs to judge the correctness of results because they found it would lead to “reward hacking.”

A Reward Hacking Digression

“Reward hacking” is worth spending some time on. During reinforcement learning, the model being trained often finds unexpected ways to maximize its score without achieving the intended goal. This is “reward hacking,” and it’s the bane of RL engineers. Here’s a few examples of reward hacking during RL, compiled by Lilian Wang:

Reward hacking: the monkey’s paw of machine learning.

As we covered earlier, initial reinforcement learning efforts involved humans. Workers, usually contractors, would provide feedback to models to teach them how to converse and be honest, helpful, and harmless.

As labs used RL more and more, the cost and speed of human workers became a bottleneck. To address this, teams started using LLMs to provide feedback. A larger model serves as a “teacher”, training a smaller and/or new “student” model. This teacher model will evaluate the student model’s results, providing feedback which is then used to further train the student model.

This “teacher model” pattern scales up much better than humans, keeping costs low and training times short (at least compared to RL from human feedback). However, the imperfect nature of LLMs makes this pattern especially susceptible to reward hacking. So much so, that the DeepSeek team eschewed the pattern entirely.

So if they didn’t use models to evaluate model output, what determined correctness during RL?

In their paper, the DeepSeek team provides two examples of how they evaluated responses. For math problems, they simply compared the provided result with the known result. For coding problems, they used a compiler to ensure the code response could in fact run and compared the compiler output to predefined test cases. No other examples were provided.

Immediately we see an issue: these RL tactics only work for problems that can be quantifiably validated. Sure, DeepSeek showed that models can learn to reason with only reinforcement learning. But this skill can only be elicited in fields with objectively testable results.

The Strengths & Limits of Reasoning Models

Reasoning models deliver outsized performance in quantitative fields, like math and coding, but only slightly move the needle in qualitative domains.

This limitation was immediately apparent with o1, whose English Literature and English Language scores closely matched non-reasoning models. DeepSeek’s R1 paper explained why the limitation existed. Reinforcement learning is necessary to teach models to reason, but only when performed on a scale where the cost and speed of human reviewers is prohibitive. To train a model to reason, you can either use a teacher model and risk reward hacking or you can limit yourself to validatable problems.

Beyond the fields of coding and math, reasoning models can help with casual queries. Their ‘thinking’ functionally extends simple prompts, providing further context that allows LLMs to hone in on better answers. But this doesn’t outperform good prompts; the ceiling of total performance remains stubbornly similar to non-reasoning models.

And reasoning comes at a cost. More output costs more and takes more time, usually several multiples higher. They should be used strategically: when you want to explore the many facets of a question, plan an approach to a challenge, or when a non-reasoning model fails to solve a bug in your code.

But for most questions reasoning is overkill. It’s slower, more expensive, and barely better than a non-reasoning model for general use.

The Impact of Reasoning Models

How will the rise of reasoning models change the AI ecosystem?

The best models think longer: “Test-time compute” (aka, spending more time printing tokens to reason about a problem) is now fully established as a new scaling law.
Compute needs are shifting to inference: To support longer thinking, we need to run models longer. As scaling pretraining delivers diminishing returns (see GPT-4.5’s and Llama 4’s lackluster reception), a greater share of AI compute will be used for inference.
Models will keep getting better at testable skills: Quantitive domains – like programming and math – will continue to improve because we can use unit tests and other validation methods to create more synthetic data and perform more reinforcement learning. Qualitative chops and knowledge bank capabilities will be more difficult to address with synthetic data techniques and will suffer from a lack of new organic data.
An AI perception gap will emerge: Those who use AIs for programming will have a remarkably different view of AI than those who do not. The more your domain overlaps with testable synthetic data and RL, the more you will find AIs useful as an intern. This perception gap will cloud our discussions. (I have only seen Scott Rosenberg at Axios (of all places!) touch on this growing divide).
Usage data becomes more valuable: At the scale needed for reasoning training, human-powered RL is prohibitively expensive and slow. However: your ChatGPT usage today is slowly chipping away at that requirement. Companies successful at capturing user share – OpenAI, Anthropic, Google, and Meta – will have a hard-to-beat advantage when it comes to porting reasoning to qualitative problems.

If you want to dive deeper into these topics, I’ve written a few pieces worth your time:

DeepSeek’s R1 reasoning model attained math and coding performance nearly identical to o1. On the ARC-AGI benchmark, R1 scored 15% compared to o1’s 20%… But R1 did it while being 7 times cheaper. ↩

The Dynamic Between Domain Experts & Developers Has Shifted

2025-04-10T06:26:00-07:00

During the peak of mobile app madness, iOS and Android developers would often find themselves cornered by friends, relatives, and random people at parties.

“I’ve got a great idea for an app…”

More often than not, this dreaded sentence would be followed by a hard sell when the developer didn’t display adequate enthusiasm. If the developer didn’t act fast and feign the exact right level of approval — enough to communicate they ‘got’ the idea but not so much that they’d be asked to build it — the idea guy would advance onto hashing out NDAs, equity allocations, and asking when coding can start.

Recently, I’ve noticed the AI era is a bit different. The balance of power has shifted. Builders need domain experts as much as domain experts need builders.

You can no longer simply copy an app model with a few improvements or obsess over user feedback as you sharpen your prototype towards product-market fit.

To build a differentiated AI product you need training data and examples curated by a domain expert.

You need experts to evaluate your prompt, to speak the plain language to the model, and hammer out the edge cases based on the failures we’ve seen. Sure, programmers still will code up the UI, data pipelines, dashboards, API integrations, and more. But you need domain experts to bootstrap your prompts and evaluate your failures.

Recently, the always excellent Hamel Husain put his finger on this changing dynamic in an article on improving AI products:

I recently worked with an education startup building an interactive learning platform with LLMs. Their product manager, a learning design expert, would create detailed PowerPoint decks explaining pedagogical principles and example dialogues. She’d present these to the engineering team, who would then translate her expertise into prompts.

But here’s the thing: prompts are just English. Having a learning expert communicate teaching principles through PowerPoint, only for engineers to translate that back into English prompts, created unnecessary friction. The most successful teams flip this model by giving domain experts tools to write and iterate on prompts directly.

Hamel sees this pattern, “with lawyers at legal tech companies, psychologists at mental health startups, and doctors at healthcare firms.”

I too have seen this. The first generation of AI-powered products (often called “AI Wrapper” apps, because they “just” are wrapped around an LLM API) were quickly brought to market by small teams of engineers, picking off the low-hanging problems. But today, I’m seeing teams of domain experts wading into the field, hiring a programmer or two to handle the implementation, while the experts themselves provide the prompts, data labeling, and evaluations.

For these companies, the coding is commodified but the domain expertise is the differentiator.

This morning, OpenAI underscored this point by launching a “Pioneers Program”. TechCrunch summarizes the program:

Through the Pioneers Program, OpenAI hopes to create benchmarks for specific domains like legal, finance, insurance, healthcare, and accounting. The lab says that, in the coming months, it’ll work with “multiple companies” to design tailored benchmarks and eventually share those benchmarks publicly, along with “industry-specific” evaluations.

To understand model performance and in-turn guide model development, OpenAI is turning to outside domain experts.

The tables have turned. The AI leader is now cornering the lawyers and doctors at the parties.

“I’ve got a great idea for an eval…”

AI Chatbots Are Like Observational Comics

2025-04-08T09:28:00-07:00

Both lose their magic when talking about something you know

As a child of the 90s, I was raised watching observational comedy specials. Seinfeld, Carlin, Hedberg, Rock, Poundstone, Izzard… I watched it all. It wasn’t just the jokes I enjoyed. I was in awe of each comedian’s ability to fully engage you (and the entire audience), managing your expectations and attention. By themselves, alone on stage. By just talking.

You don’t notice how much work they’re doing – they make it appear effortless. But a decade later, I caught a post on Twitter from Seinfeld saying he’d be trying out new material at a Manhattan club that night. I quickly snagged tickets and two hours later we watched him walk on stage during a Wednesday open mic.

Watching Seinfeld try out new material gives you a peak behind the curtain. Before each bit he’d lift up his notepad, push up his glasses, hunch over, and quietly read the note aloud to himself: “Five hour energy drink… five hours is a weird amount of time.”

Every time he read his notes, you’d cringe a bit. “C’mon Jerry, that’s not a Seinfeld joke,” you’d think, “That’s a joke from somebody making fun of Seinfeld.”

The room never laughed when he read his notes.

Then he’d put the notes away, stand up straight, and switch into performance mode. The Seinfeld you’d always known magically reappeared and did the bit he’d just read to a dead room – and people ate it up. The notepad was gone¹, all the work he’d put into honing the joke was hidden. It was just him, on stage, making it look easy.

ChatGPT is kinda like that.

Chatbots are practiced performers.

We only see chatbots’ responses. The mountains of data (~250 billion webpages just from Common Crawl), cumulative decades of work from unknown contractors teaching it how to converse, and billions of dollars of GPUs… It’s all invisible. We only see the confident, polished performance. And people eat it up.

I see people who don’t know how LLMs are built treat them like all-knowing experts, trusting everything that comes out². Well… Nearly everything.

This brings us to another way AI chatbots are like observational comedy: they both lose their magic when talking about your expertise.

Nothing breaks the spell of an observational comic like a joke about something you know well. You might chuckle a bit, but the spell snaps and you think, “Actually, there’s a very good reason it’s like that…”

This effect was captured well during a Chris Rock guest appearance on King of the Hill. Voicing stand-up comic Buddha Sack, he trades “yo mama” jokes with noted propane expert, Hank Hill:

Buddha Sack: It’s been so long since yo mama’s last bath that her hairy armpits smell like propane gas.

Hank Hill: Now excuse me, hold on there fella. A joke’s a joke, but now you’ve gone too far. Propane has no natural odor. What you smell is actually put there by man for safety purposes.

After a string of well-received jokes, touching on Hank’s expertise spoiled the mood. Comedy performances are less impressive when a joke touches on your expertise.

Time and time again, when talking to people who rely on ChatGPT, Claude, Perplexity, and other general AI tools, I hear them say, “AI is incredible. It handles nearly everything I throw at them.”

“What does it fumble with?” I’ll ask.

“Well, it still gets things wrong when it comes to my line of work.”

Lawyers say this. So do accountants, marketers, researchers, salespeople, and engineers.

Chatbots know everything, but they make mistakes when it comes to things I know.

🤔

To be clear, these people frequently use chatbots to help them with their work. They just keep them on a close leash, reviewing and revising their work. This is the Intern use case: “Supervised copilots that collaborate with experts, focusing on grunt work.”

Programming is probably the best example going at the moment.

The new model of a software start up is a couple people armed with Cursor licenses, shipping apps in a handful of months. Previously, it would have taken a full team and a couple years to achieve this quality. Gone are the days of software start-ups needing piles of cash to build their product³.

This isn’t vibe coding. These are A-tier programmers using AI to help them ship faster. They design the architecture, pick the tools and libraries, sketch out the apps, then use Cursor and other AI tools to implement everything faster. The LLMs aren’t perfect – but the experts driving can easily mitigate their shortcomings.

Last week, a friend in the gaming industry told me this is happening with the best artists in gaming as well. Talented designers will sketch out models, then leave it to AI systems to perform the tedious work of constructing the wireframes. The tools aren’t perfect – designers are always tweaking and polishing the output before shipping the asset. But smaller teams of the best people are doing work only large corporations could previously achieve.

John Carmack just captured this idea perfectly:

AI tools will allow the best to reach even greater heights, while enabling smaller teams to accomplish more, and bring in some completely new creator demographics. Yes, we will get to a world where you can get an interactive game (or novel, or movie) out of a prompt, but there will be far better exemplars of the medium still created by dedicated teams of passionate developers.

This is the pattern: experts, firmly in the driver’s seat, using AI to go farther, faster.

This is a good pattern because the expert covers AI’s errors. The pattern we have to worry about is when laypeople hand over a job to AI and fully trust the output.

Personally, I don’t worry about superintelligent AGI enslaving humanity. But I do worry about people using AI to make consequential decisions affecting others’ lives in domains where they themselves lack expertise. Because that’s happening now.

Chatbots are like observational comics. They’re incredibly good at creating authority through performance, but the trick fails when you’re an expert on the topic at hand. This doesn’t mean you shouldn’t use them – but always be cautious when you’re doing consequential work, outside your area of expertise.

Remember that the best in their fields – programmers, designers, lawyers, accountants, writers, and more – never let the AI drive when it comes to their respective expertise.

A friend informs me the notebook isn’t exactly hidden: Seinfeld published a selection of his notebooks in 2020. It even includes the notes for the 5 hour energy bit. ↩
To be fair, people who do know how LLMs work are also impressed, but they have better mental models for what to ask and how far to trust them. ↩
How this dynamic is going to affect the VC ecosystem is a topic for another day… ↩

MCPs are APIs for LLMs

2025-03-18T10:23:00-07:00

Everyone is suddenly talking about Model Context Protocol (MCP), without explaining what it is. On Google Trends, searches for, “What is MCP?” are rising nearly as fast as those for “Model Context Protocol”.

Here’s the headline: MCP servers are APIs LLMs can use.

Released by Anthropic last November, the Model Context Protocol is described as, “a new standard for connecting AI assistants to the systems where data lives, including content repositories, business tools, and development environments.” But even that description is a bit jargony.

Here’s the simple version:

An MCP server exposes a bunch of end points, like any other API server, but it must have end points that list all the available functions on a server in a standard way an MCP client can understand.
MCP clients (usually LLM-powered), like Anthropic’s Claude Desktop, can then be connected to MCP servers and immediately know what tools are available for them to use.
LLMs connected to MCPs can now call MCP servers using the specs provided by the API.

That’s it! It’s incredible simple, a standard to enable the Web 2.0 era for LLM applications, giving models plug-and-play access to tools, data, and prompt libraries.

If you want to try MCP (and you’re on a Mac), do this:

Download Claude Desktop. Install it, start it up, try it out.
Download iMCP, a MacOS app that connects your Apple calendar, contacts, location, messages, reminders, and weather to an MCP server it manages. (You can turn these on one by one. If you’re cautious about exposing sensitive data, just turn on weather and location for the demo. But these are trusted devs who’ve been active in the indie Mac development community forever). Here’s a direct download link
Follow the instructions described on the iMCP page then launch (or relauch) Claude Desktop.

After clicking through a few privacy and access alerts, you’ll see a little hammer icon in the bottom right of your Claude text field. If you drop it down, you’ll see each tool iMCP provides, which Claude Desktop obtained by calling a “list” function each MCP server must provide.

Now, ask Claude, “What’s the weather today?” A few more permission dialogues will arise and you’ll see Claude calling iMCP to get your location and weather.

You can build whatever type of MCP server you want, and people do. There are servers that allow LLMs to access your Gmail, chat on Discord, control Ableton Live, search Notion, access databases, and so much more. Unfortunately, most of these servers have relatively tedious set up processes involving Docker and (usually) obtaining developer access keys from whatever service is being accessed. There’s been some steps recently to manage and run servers via the command line, but I have yet to see an easy-to-use GUI app.

But it’s early days. MCP is just 113 days old, but the amount of experimentation and momentum is incredibly high.

I predict most companies with an AI and/or an API strategy will launch an MCP this year. If OpenAI adopts MCP in its clients, make it 6 months.

Overcoming Bad Prompts with Help from LLMs

2025-03-16T19:17:00-07:00

Two Examples of Features – from OpenAI & Anthropic – Which Mitigate Bad Prompts

Among new or novice prompters, prompts are generally too short. They lack context, rarely feature examples, and provide few descriptions. Chatbot makers and model providers have come to understand the tyranny of the black text box and adopted measures to help their users.

One of my favorite measures is how ChatGPT rewrites bad image-generation prompts before it hands them to DALL-E. Back in November of 2023, I wrote:

You don’t have to remember the incantations and phrase books needed to get good results. You don’t have to hang out on the Stable Diffusion Discord and make notes of keywords strings to conjure the results you may want. DALL-E in ChatGPT is better for nearly everyone than DALL-E, Stable Diffusion, or Midjourney on their own.

Revisiting this feature, I decided to poke behind the curtains and see how the prompts are being modified¹.

Visiting ChatGPT, I prompted it with the terrible prompt, “Create an image of a statue thinking.”

Behind the scenes, ChatGPT rephrased that as:

A detailed statue of a human figure in a thinking pose, similar to Rodin’s ‘The Thinker.’ The statue is made of weathered stone, sitting on a pedestal, with a contemplative expression. The setting a peaceful outdoor environment, with soft lighting and a slightly cloudy sky.

Quite a difference! We can see the details from the modified prompt, plain as day:

But this raises a question: do all our image prompts get rephrased in ChatGPT?

There are plenty of scenarios where someone wouldn’t want their prompt edited, but there’s already plenty of toggles and drop-downs in ChatGPT’s interface. Adding another (“Don’t modify this prompt”) wouldn’t just add clutter, but needlessly confuse users unaware of this feature.

Thankfully, OpenAI has an elegant solution.

Before we reveal the trick, a brief detour…

Bad prompting isn’t limited to end users. People building applications hitting LLM APIs commonly write short prompts, that lack detail, which results in unsatisfactory and inconsistent results. More than a few times, I’ve chatted with frustrated developers, who think LLMs are over-hyped because a bad prompt couldn’t enable a simple feature. The problem, every time, is their prompt.

And the tool I pointed each of them to is truly a hidden gem: Anthropic’s prompt generator.

Log into their API console and hit the “Generate a prompt” button. Describe the task you want an LLM to complete, hit “Generate”, and dig into your new, effective prompt.

Let’s give the prompt generator the following input, inspired by ChatGPT’s DALL-E prompt improvement feature:

A user will provide a prompt to generate an image from an image model. This prompt will likely be too short, lack description, and will generate a boring or unintended output. Given this input prompt, generate a better prompt that is more descriptive and likely to generate a more polished image. Provide plenty of details and clearly describe the image you wish to create based on the prompt.

Anthropic’s tool generated a much improved 236-word prompt – 3.5x longer than our input! It set the context, specified an output format, enumerated specific guidelines, spurred the LLM to reason first, and reiterated the goal simply.

Populating this prompt with our, “Create an image of a statue thinking,” prompt, we get some reasoning followed by this enhanced prompt:

A weathered bronze statue of a philosopher in a contemplative pose, seated on a stone pedestal with one hand supporting its chin, situated in an overgrown classical garden at dusk. The sculpture’s patinated surface catches the warm golden light of the setting sun, creating dramatic shadows that accentuate its pensive expression and furrowed brow. Small vines and moss partially climb the base of the monument, while a shallow reflecting pool in front mirrors the statue’s thoughtful silhouette in photorealistic detail.

Fantastic. Anthropic’s prompt generator truly is an incredible tool. For each of the developers who reached out to me, one try at the prompt generator and some light editing completely fixed their features.

The prompt generator should be the first stop for any developer, with little AI experience, looking to ship an LLM-powered function.

Armed with our Anthropic-generated enhanced prompt, we can digress back to ChatGPT and its DALL-E prompt editing.

If we drop our enhanced prompt into ChatGPT and hit enter…it doesn’t edit our prompt.

The app recognizes this prompt as sufficiently detailed and passes it along – unaltered – to DALL-E. This is excellent UX, a perfect example of a quiet AI feature: if you give ChatGPT a bad image prompt, it’ll improve it; but if you give it a good one, it won’t interfere.

If you want to follow along, jump into the developer tools of your web browser and hang out in the “Network” tab. There you can observe the prompt modifications as they’re passed around as JSON payloads. ↩

Building an Easier to Use FFmpeg With LLMs

2025-02-24T10:40:00-08:00

Chatbots are great for navigating complex domain-specific languages and APIs, especially if you rarely interact with these interfaces. This use case surfaced quickly with ChatGPT, which proved to be excellent at writing and explaining regular expressions.

But my favorite example of this use case is writing FFmpeg incantations.

For the unfamiliar, FFmpeg is a command-line tool for converting video and audio files. It is ridiculously powerful and ridiculously complex. Most people I know use Google as the primary interface for FFmpeg: search for the job to be done, copy-and-paste the command. This pattern quickly migrated to ChatGPT and Claude, which proved excellent at FFmpeg incantations.

But it can get even easier…

Recently, Simon Willison added an “extraction” feature to his llm utility, enabling us to return only the code from our LLM response, with none of the preamble. The tool looks for the first instance of Markdown-fenced code and returns only that block.

Armed with this feature, we can write a simple shell script to streamline this workflow:

#!/usr/bin/env zsh

# Check for if the user wants to execute the command
flag_x=false
for arg in "$@"; do
    if [[ "$arg" == "-x" ]]; then
        flag_x=true
        break
    fi
done

# Hit the model and get the ffmpeg command
output=$(llm "$1" \
    --system "You are an expert at writing commands for ffmpeg. You will be given prompts describing what the user wants to do with ffmpeg. to the best of your abilities, translate these plain language descriptions into a single one-liner that calls ffmpeg, with all the appropriate flags and input/output specifications. Do not use the variable 'total_frames' in any select statement. Ensure the command is wrapped as a code block." \
    --extract)

# Print or execute the command
if [[ "$flag_x" == true ]]; then
    echo "$output"
    eval "$output"
else
    echo "$output"
fi

Stashing this script in my path as ffmsay, I can now run ffmsay -x 'Do things to video example.mp4'. The -x flag tells the script to not only print, but execute the command.

And it works great! Testing it on a random YouTube video I downloaded, here’s some tasks I ran:

“Extract the audio from the sample.mp4 video and save it as a stereo mp3 with lossless compression.”

This yielded: ffmpeg -i sample.mp4 -vn -acodec libmp3lame -b:a 320k -ac 2 sample.mp3

“Convert the video sample.mp4 into a video the size of a postage stamp in black and white color.”

Yielding: ffmpeg -i sample.mp4 -vf "scale=50:50,format=gray" -sws_flags lanczos output.mp4

“Extract 9 random keyframes, as a grid in a single image named grid.png, from the video sample.mp4.”

Yielding: ffmpeg -i sample.mp4 -vf "select='eq(pict_type,I)',scale=320:180,tile=3x3" -frames:v 9 grid.png

This threw a complaint about the output filename not having a numeric variable, but did the job perfectly:

“Render a random snippet 2 seconds long from sample.mp4 as an animated GIF. Adjust the settings for a lightweight website GIF.”

This yielded the most complex call yet: ffmpeg -i sample.mp4 -t 2 -ss $(ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 sample.mp4 | awk -v seed=$RANDOM 'BEGIN{srand(seed)} {print rand()*($1-2)}') -vf "fps=10,scale=320:-1:flags=lanczos,split[s0][s1];[s0]palettegen[p];[s1][p]paletteuse" -f gif output.gif

And worked perfectly!

Interfacing with rarely used, complex interfaces is a perfect LLM use case. Previously, API and UX designers have had to simplify interfaces for powerful tools to produce reasonable surface areas for casual users. But with LLMs we can expose all the complexity with few of the downsides.

This FFmpeg example, is one we can handle out-of-the-box with little precision prompting. For newer, or less commonly used interfaces, simple system prompting and/or a bit of fine-tuning can yield similar results.

What if AGI is Free?

2025-02-11T11:30:00-08:00

What if super-intelligent AI¹ arrives but it can be run by anyone, basically for free?

To me, this is one of the most interesting questions in the field. I’m not expecting AGI to arrive anytime soon, but I do expect models to keep getting more capable (especially for objective tasks like math and programming) and keep getting more efficient.

I had an interesting meeting last week with a founder who’s instructed his team to never worry about LLM costs – and over the last year it’s been great bet as costs have gone down while usage is up. He’s growing his own value, in the form of the apparatus built around the models, while the market value of the models themselves is plummeting.

The tasks models can perform a year ago are 10x cheaper today. Sometimes it’s even more significant: Sam Altman wrote this week:

The cost to use a given level of AI falls about 10x every 12 months, and lower prices lead to much more use. You can see this in the token cost from GPT-4 in early 2023 to GPT-4o in mid-2024, where the price per token dropped about 150x in that time period. Moore’s law changed the world at 2x every 18 months; this is unbelievably stronger.

But when it comes to the efficacy of these models, it’s not as unified a story.

Sam expects, “the impact of AGI² to be uneven. Although some industries will change very little, scientific progress will likely be much faster than it is today; this impact of AGI may surpass everything else.”

I agree the impact will be uneven, mostly because of the capabilities of the best models, rather than just structural contexts. Models will continue to excel at objectively measurable tasks – like math and programming – while lagging behind in qualitative tasks, thanks to the nature of synthetic data. As I wrote in December:

Models will get better at testable skills: Quantitive domains – like programming and math – will continue to improve because we can create more novel, massive, synthetic datasets thanks to unit tests and other validation methods. Qualitative chops and knowledge bank capabilities will be more difficult to address with synthetic data techniques and will suffer from a lack of new organic data.

So then let’s hone the question further: what if AI puts programmers out of a job, but everyone with a laptop has access to free programmers?

Sure, there’s the caveats about how skilled or unskilled these roles and models will be, but if quality and efficiency continue to march on, hand in hand, the two-fold effect of taking away work while simultaneously gifting us labor will be interesting.

I wanted to include a footnote about how I’m using the term “AGI” here, since I generally dislike the way it is used for generating hype, fear, and whatnot. When I say, “AGI,” I mean, “an LLM powered application which can do a human job, autonomously.” In this case, it might mean watching for Github issues and fixing them. I do not use the term “AGI” to suggest that these models will actually be intelligent, sentinent, or anything other than programs capable of performing a job (don’t make me tap the sign!) ↩
What’s hilarious is that Sam too felt the need to include a footnote about his usage of “AGI”! Stating, “By using the term AGI here, we aim to communicate clearly, and we do not intend to alter or interpret the definitions and processes that define our relationship with Microsoft. We fully expect to be partnered with Microsoft for the long term. This footnote seems silly, but on the other hand we know some journalists will try to get clicks by writing something silly so here we are pre-empting the silliness…” While I get the frustration, I think so much of this is due to OpenAI and Sam himself taking advantage of this vagueness when pitching OpenAI or talking about AI in general. Sometimes it’s just computer software, other times it’s super intelligence right around the corner. The language is mushy, but he’s leveraged that mushiness. ↩

A Gentle Intro to Running a Local LLM

2025-02-04T10:31:00-08:00

How to Run a Chatbot on Your Laptop (and Why You Should)

Open, small LLMs have gotten really good. Good enough that more people – especially non-technical people – should be running models locally. Doing so not only provides you with an offline, privacy-safe, helpful chatbot; it also helps you learn how LLMs work and appreciate the diversity of models being built.

Following the LLM field is complicated. There are plenty of metrics you can judge a model on and plenty of types of models that aren’t easily comparable to one another. Further, the metrics that enthusiasts and analysts evaluate models might not matter to you.

But there is an overarching story across the field: LLMs are getting smarter and more efficient.

And while we continually hear about LLMs getting smarter, before the DeepSeek kerfuffle we didn’t hear so much about improvements in model efficiency. But models have been getting steadily more efficient, for years now. Those who keep tabs on these smaller models know that DeepSeek wasn’t a step-change anomaly, but an incremental step in an ongoing narrative.

These open models are now good enough that you – yes, you – can run a useful, private model for free on your own computer. And I’ll walk you through it.

Installing Software to Run Models

Large language models are, in a nutshell, a collection of probabilities. When you download a model, that’s what you get: a file(s) full of numbers. To use the model you need software to perform inference: inputting your text into a model and generating output in response to it. There are many options here, but we’re going to pick a simple, free, cross-platform app called Jan.

Jan is an open-source piece of software that manages and runs models on your machine and talks to hosted models (like the OpenAI or Anthropic APIs). If you’ve used ChatGPT or Claude, its interface will feel familiar:

Go download Jan and come back when it’s installed.

In the screenshot above we’re in the chat tab (see that speech bubble icon in the upper left?) Before we can get started, we’ll need to download a model. Click the “four square” icon under the speech bubble and go to the model “hub”.

Making Sense of Models

In Jan’s model hub we’re presented with a list of models we can download to use on our machine.

We can see the model name, the file size of the model, a button to download the model, and a carat to toggle more details for a given model. Like these:

If you are new to this, I expect these model names to be very confusing. The LLM field moves very fast and has evolved language conventions on the fly which can appear impenetrable to the unfamiliar (don’t make me tap the sign!).

But we can clear this up. You don’t need to understand the following to run a local model, but knowing the language conventions here will help demystify the domain.

Let’s go from left to right:

Family: Models come in families, which helps you group them by the teams that make them and their intended use (usually). Think of this as the model’s brand name. Here, it’s “Llama”, a family of open models produced by Meta.
Generation: The generation number is like a version number. A larger number means a more recent model, with the value to the left of the decimal indicating major generations. The number to the right of the decimal might indicate an incremental update or signify a variant for a specific use case. Here we’re looking at 3.2, a generation of Llama models which are smaller sized and designed to run “at the edge” (aka, on a device like a phone or PC, not a remote server).
Parameters: As I said, models are essentially collections of probabilities. The parameter count is the number of probabilities contained in a given model. This count loosely correlates to a model’s performance (though much less so these days). A model with more parameters will require a more powerful computer to run. Parameter count also correlates with the amount of space a model takes up on your computer. This model has 1 billion parameters and clocks in at 1.23GB. There is a 3 billion parameter option in the Llama 3.2 generation as well, weighing in at 3.19GB.
Variant: Different models are built for different purposes. The variant describes the task for which a model is built or tuned. This model is made to follow instructions, hence, “Instruct.” You will also see models with “Chat” or “Code” (both self-explanatory). “Base” we tend to see less of these days, but it refers to models that have yet to be tuned for a specific task.
Quantization: Quantization is a form of model compression. We keep the same number of parameters, but we reduce the details of each. In this case, we’re converting the numbers representing the probabilities in the model from highly detailed numbers with plenty of decimal places to 8-bit integers: whole numbers between -128 and 127. The “Q8” here says the weights in the model have been converted to 8-bit integers. We’re saving plenty of space in exchange for some potential model wonkiness during usage.

Don’t worry too much about the quantization notation – you don’t need to know it. Jan very helpfully provides us with warnings if a model won’t run well on our machines:

When getting started, feel free to stick with the model recommendations I list below. But if you want to explore, download and try anything that can run on your machine. If it’s slow, non-sensical, or you just don’t like it: delete it and move on.

The models listed here are curated by Jan. But Jan can run any text model hosted on Hugging Face (a website for sharing datasets and machine learning models) in the GGUF format (a file format for sharing models). And there’s plenty of them.

But let’s put a pin in this.

For now, go back to Jan’s model hub and hit the “Download” button for “Llama 3.2 3B Instruct Q8”. If Jan says this model will be, “Slow on your device,” download “Llama 3.2 1B Instruct Q8”. It’s smaller, but still great for its size. Come back when the download is complete.

Chatting with Jan

When Llama 3.2 is downloaded, switch back to the chat pane by clicking the speech bubble in the upper-left. In the text field at the bottom, there’s a model selector. Check to make sure it’s set to the model you just downloaded:

As you download more models, you can switch between them here.

Typing text in the text field and hitting enter creates a new thread (just like ChatGPT or Claude).

Here’s what the 3B parameter model says when we ask, “Why might someone want to run a LLM on their laptop, rather than using a cloud provider?”

While running a Large Language Model (LLM) on a cloud provider can be convenient and scalable, there are some reasons why someone might prefer to run an LLM on their own laptop:

**Cost savings**: Cloud providers charge by the hour or usage, which can add up quickly. Running an LLM locally can save money in the long run, especially if you only need to use it occasionally.
**Data ownership and control**: When you host your LLM on a cloud provider, you may have limited control over how your data is stored, processed, and shared. Running it locally ensures that you maintain full control over your data and can ensure its privacy and security.
**Customization and flexibility**: By running an LLM locally, you can customize the environment to suit your specific needs without relying on cloud provider configurations or APIs. This can be particularly useful for research or development purposes where you need to experiment with different models or hyperparameters.
**Faster training times**: Training a large language model can take significant time and computational resources. Running it locally means that you don't have to worry about network latency, API timeouts, or other cloud-related issues that might slow down the training process.
**Offline access**: With a local LLM, you can work on projects without relying on an internet connection, making it ideal for situations where connectivity is limited or unreliable.
**Reduced dependency on cloud providers**: By running your own LLM locally, you reduce your dependence on cloud providers and their infrastructure. This can be beneficial if you're concerned about the stability of cloud services or prefer to maintain a more decentralized approach.
**Self-hosting and security**: Running an LLM locally allows you to implement additional security measures and ensure that your model is not exposed to external attacks or vulnerabilities.

However, it's worth noting that running a large language model on a laptop can be computationally intensive and may require significant resources (e.g., GPU power). If you're planning to run a large-scale LLM, cloud providers often offer more suitable solutions with optimized infrastructure and economies of scale.

And here’s the 1B parameter model response to the same question:

Running a Large Language Model (LLM) on a laptop can be beneficial for:

1. **Small projects or proof-of-concepts**: For smaller tasks, like experimenting with different models or hyperparameters, running an LLM locally might be sufficient.
2. **Research and development**: Researchers may need to run small-scale experiments or prototype their ideas without being tied to cloud infrastructure.
3. **Personal use cases**: Individuals who only occasionally work on language-related projects can benefit from local hosting, as it provides a cost-effective alternative to cloud providers.

However, running an LLM on a laptop has some limitations:

1. **Limited scalability**: As the model size or complexity increases, local hosting may become impractical due to computational constraints.
2. **Data storage and processing**: Running an LLM locally requires significant storage space for data and intermediate results, which can be challenging in laptops with limited disk capacity.
3. **Network latency and security concerns**: Local connections can introduce latency and compromise the model's security, especially if you're handling sensitive or proprietary data.

To overcome these limitations, researchers and developers often use:

1. **Cloud-based distributed training**: Training large models on a cloud provider, which allows for scalability and parallel processing.
2. **Serverless computing**: Using serverless platforms that provide a pay-as-you-go pricing model, enabling users to scale up or down as needed.
3. **Edge computing**: Deploying computations closer to the source of data, reducing latency and improving security.

Ultimately, whether running an LLM on a laptop or cloud provider depends on your specific needs, project requirements, and preferences.

Both are pretty good for such tiny models! The 3B-sized Llama does a better job, sticking to the question and providing numerous reasons. But this comes with a speed cost: on my Mac with 64GB of RAM, Llama 3B produces ~50 words-per-minute while Llama 1B produces ~113.

Figuring out which model is right for you takes experimentation. But I have a few recommendations to get you started.

Recommended Models

Unless otherwise noted, you can download these models in Jan by copy-and-pasting the URL in the model name link.

For people with (at least) a Mac with 32GB of RAM or an NVIDIA RTX 4090:

Mistral-Small-24B-Instruct is a bit of a revelation. While there have been open GPT-4o class models that fit on laptops (Llama 3.3 comes to mind), Mistral-Small is the first one I’ve used whose speed is comparable as well. For the last week, I’ve been using it as my first stop – before Claude or ChatGPT – and it’s performed admirably. Rarely do I need to try a hosted model. It’s that good.

If you can run the “Q4” version (it’s the 13.35GB option in Jan, when you paste in the link), I strongly recommend you do. This is the model that inspired me to write this post. Local models are good enough for non-nerds to start using them.

Also good: Microsoft’s Phi-4 model is a bit smaller (~8GB when you select the “Q4” quantized one) but also excellent. It’s great at rephrasing, general knowledge, and light reasoning. It’s designed to be a one-shot model (basically, it outputs more detail for a single question and isn’t designed for follow-ups), and excels as a primer for most subjects.

For people who want to see a model “reason”:

Yeah, yeah, yeah…let’s get to DeepSeek.

DeepSeek is very good, but you’re not likely going to be able to run the original model on your machine. However, you likely can run one of the distilled models the DeepSeek team prepared. Distillation is a strategy to create lightweight versions of large language models that are both efficient and effective. A smaller model ‘learns’ by reviewing the output from a larger model.

In this case, DeepSeek R1 was used to train Qwen 2.5, an excellent set of smaller models, how to reason. (DeepSeek also distilled Llama models) Getting this distilled model up and running in Jan requires an extra step, but the results are well worth it.

First, paste the DeepSeek-R1-Distill-Qwen14B model URL into Jan’s model hub search box. If you’re on Mac, grab the “Q4_1” version, provided you can run it. On Windows, grab the “Q4-K_M”. (If either of those are flagged as not being able to run on your machine, try the 7B version. It’s about ~5GB.)

Once the model downloads, click the “use” button and return to the chat window. Click the little slider icon next to the model name (marked in red, below). This toggles the thread’s settings. Toggle the “Model Settings” dropdown (marked in blue) so that the “Prompt template” is visible.

Your prompt template won’t look like the one in the image above. But we’re going to change that.. Paste the following in the prompt template field, replacing its current contents:

{system_message}
<|User|>
{prompt}
<｜end▁of▁sentence｜>
<｜Assistant｜>

Think of prompt templates as wrappers that format the text you enter into a format the model expects. Templates can vary, model by model. Thankfully, Jan will take care of most of this for you if you stick to their model library.

This change we’ve made, though, will let you see this model “think”. For example, here’s how it replies to our question about why one might want to use local LLMs. The reasoning is bracketed by and tokens and prefaces the final answer.

Why Bother?

There are giant, leading LLMs available for cheap or free. Why should one bother with setting up a chatbot on their PC?

Llama 3.2 3B already answered this for us, but…

It’s free: These models work with the PC you have and require no subscriptions. Your usage is only limited by the speed of the model.
It’s 100% privacy-safe: None of your questions or answers leave your PC. Go ahead, turn off your WiFi and start prompting – everything works perfectly.
It works offline: The first time I used a local model to help with a coding task while flying on an airplane without WiFi, it felt like magic. There’s something crazy about the amount of knowledge these models condense into a handful of gigabytes.
It’s customizable: We only scratched the surface here by changing our prompt template. But unfold the “Inference Settings” tab and take a look at the levers waiting to be pulled. Discussing these all is beyond the scope of this article, but here’s a quick tip: the “Temperature” setting effectively controls how much randomness is added during inference. Try setting it to each extreme and see how it changes your responses.
It’s educational: This is the main reason you should bother with local LLMs. Merely grabbing a few models and trying them out demystifies the field. This exercise is an antidote to the constant hype the AI industry fosters. By getting your hands just slightly dirty, you’ll start to understand the real-world trajectory of these things. And hey, maybe the next DeepSeek won’t be so surprising when it lands.

So much of the coverage around LLMs focuses on raising the ceiling: the improved capabilities of the largest models. But beneath this noise the floor is being raised. There’s been incredible progress: the capabilities of models you can run on moderately powered laptops perform as well as the largest models from this time last year. It’s time to try a local model.

DeepSeek as a Power Object

2025-01-31T10:10:00-08:00

The wave of DeepSeek takes reveal more about our own hopes and concerns than they do about DeepSeek.

One of my college professors, Donna Haraway, used the term “power objects” to define topics that we don’t fully understand, yet recognize to be important. As a result, we end up projecting onto these “power objects” our own concerns.

DNA was a power object. As were PET Scans, Big Data, the Internet, nuclear power, and so much more. Anything new we believe to be important while not understanding why it’s important is a “power object.” (At least until we understand it sufficiently to remove the mystique.)

DeepSeek is a textbook “power object,” with every hot take spotlighting a person’s desires and fears. Here’s some I’ve noticed:

To AI Boosters, DeepSeek proves LLM progress won’t slow down.
To AI Skeptics, DeepSeek proves AI companies have no moat.
To Open Source Advocates, DeepSeek proves open development is superior.
To AI Doomers, DeepSeek proves we aren’t being careful enough.
To Security Researchers, DeepSeek proves the risks of backdoors in models.
To Privacy Advocates, DeepSeek proves LLM services are vectors for collecting sensitive data.
To China Hawks, DeepSeek proves we need more US sanctions and investment in US champions.
To China Doves, DeepSeek proves sanctions don’t work and only inspire new innovation.
To NVIDIA Bears, DeepSeek proves we won’t need new data centers full of NVIDIA clusters.
To NVIDIA Bulls, DeepSeek proves we’ll need more GPUs for synthesizing training data and running inference.

…and so on!

Whenever these big waves of takes hit – and a wave of DeepSeek’s magnitude is rather rare – be aware of “power objects” and learn from the takes.

Mistral Small & Human-Centric Benchmarks

2025-01-30T10:55:00-08:00

I really like small models. They’re fast, cheap, and local – the perfect foundation for fashioning a cog in a compound AI pipeline.

Today, Mistral released Mistral Small 3, “a latency-optimized 24B-parameter model released under the Apache 2.0 license.” At 20GB, Mistral Small 3 runs well on my Mac Studio with 64GB of RAM, though Mistral notes it fits, “in a single RTX 4090 or a 32GB RAM MacBook once quantized.” So far it’s performed very well for me – knocking out code questions, extraction, rephrasing, and other general tasks. It feels on-par with Llama 3.3 70B and Qwen 32B.

But what I really like is how they benchmarked the model. Here’s a screenshot of Mistral Small 3’s Hugging Face page:

I love this. Rather than highlighting benchmarks irrelevant to their target audience (remember: you really should have your own eval), they’re publishing a quantified table of “vibes.” Below the chart they note:

We conducted side by side evaluations with an external third-party vendor, on a set of over 1k proprietary coding and generalist prompts.

Evaluators were tasked with selecting their preferred model response from anonymized generations produced by Mistral Small 3 vs another model.

We are aware that in some cases the benchmarks on human judgement starkly differ from publicly available benchmarks, but have taken extra caution in verifying a fair evaluation. We are confident that the above benchmarks are valid.

Translated: “Yeah, it’s imperfect, but here’s how people feel using this.” It’s a self-administered Chatbot Arena¹ (though they hired a 3rd party to execute it).

I don’t begrudge the big open, standard evals. They push our model development further by putting out crystal-clear challenges for teams to develop against. But it has feels like we’ve been over-fitting to the most popular evals lately. Rough tests like Mistral’s and Chatbot Arena at least attempt to bring some qualitative metrics to the table.

(Personally, I find DeekSeek-R1 to be a prime example of this. It nailed many metrics, leading to the dramatic headlines, but for most tasks I find myself turning to Claude or Llama 3.3, locally.)

Mistral’s vibes preference benchmark here is very welcome. It’s simple and I hope more

I’ve noticed Chatbot Arena isn’t cited or discussed as often as it was. In the past, mystery models on their generated waves of speculation and new leaders created waves of headlines. I chalk up it’s decline in popularity to the speed of the field these days – it takes time to establish a good score on Chatbot Arena, which doesn’t fit with the cadence of splashy, spikey launches. ↩

Is OpenAI Using Trap Tokens?

2025-01-29T10:10:00-08:00

OpenAI is accusing DeekSeek of improperly using OpenAI models to train DeekSeek’s own. Though they’ve yet to provide evidence, I can’t help but wonder: is OpenAI using “Trap Tokens”?

Let’s set aside the “pot calling the kettle black” jokes and ask how OpenAI might know DeekSeek used their models. They’re likely finding methods to generate output from DeekSeek models that only OpenAI models would produce.

For example, using a local model we can ask DeepSeek-r1, “Why are you called ChatGPT?” To which it replies as if it is ChatGPT:

Alright, the user asked, "Why are you named ChatGPT?" I need to explain the reasoning behind my name.

First, I should break down "ChatGPT" into its components: "Chat," "GPT," and "4." 

"Chat" refers to the conversational aspect since I'm designed for dialogue. It's user-friendly and approachable.

"GPT" stands for Generative Pre-trained Transformer. That highlights my technical base, which is a type of AI model trained on vast data to generate text.

The number "4" signifies that I'm part of the GPT-4 series, developed by OpenAI as an advanced version of previous models like GPT-3.

Putting it all together, ChatGPT reflects both my purpose in facilitating conversations and my underlying technology.

I'm named **ChatGPT** because:

1. **"Chat"** refers to the conversational nature of my design—my primary function is to engage in natural language conversations with users.

2. **"GPT"** stands for "Generative Pre-trained Transformer," which describes the type of AI model I am based on. The GPT architecture is a transformer-based model that has been pre-trained on a vast 
amount of text data, enabling it to generate human-like text.

3. **"4"** refers to the specific version of the GPT model I'm built upon—GPT-4, which is the latest and most advanced version developed by OpenAI as of my knowledge cutoff in July 2024.

So, "ChatGPT" combines these elements to describe a conversational AI based on the GPT-4 model!

Notably, this doesn’t work on DeepSeek’s hosted models, which appear to have added filters to block this request and other tactics (including asking for details about its API, to which it previously responded with OpenAI links.)

But what I want to know is if OpenAI is using “**Trap Tokens,” training their models to output specific, erroneous data purely to catch copy-cats. I wrote about the potential for this way back in May of 2023, in the context of content producers establishing provenance of training data:

Beyond prompt injection, will we see intentionally unique phrases in lyrics so owners can prove their work was used as training data? Such a tactic has precedence in Trap Streets, intentionally false details inserted into street maps to catch plagiarists. “Trap Tokens” in content could be designed: unique phrases that have a low likelihood of occurring elsewhere in training data and are sufficiently novel to not occur as hallucinations. Being able to elicit Trap Tokens could establish proof a model references your data.

Has OpenAI adopted this tactic with their own models? Perhaps they’ll share evidence suggesting they have, though it’s easy to evoke OpenAI responses from DeepSeek without knowlege of them.

On Authentic Data

2025-01-20T10:24:00-08:00

The emergence of the term “authentic data” highlights a new problem

I’ve been hearing the term “authentic data” more and more, these days. At the GeoBuiz summit last week, the term was uttered enough to hit a critical mass in my head, helping me realize its novelty.

Let’s quickly define it:

Authentic Data

Data that is collected from real-world events, interactions, or observations. Authentic data is *not* artificially generated or manipulated, by LLMs or other automated mechanisms.

The rise of “authentic data” illustrates a new concern: the emergence of “generated data”, data created not from observations but from machine learning and AI models. It’s a necessary distinction – one we didn’t have to make until recently – though its importance will vary by use case.

Generated Data

Data that is artificially created, often by AI models, and used to augment authentic datasets or simulate real-world scenarios.

There is a growing wariness of “generated data” among data analysts and enterprises. Just as there are fears that AI slop will poison the internet, rendering it difficult to use for both humans and machines, there is anxiety that generated data will undermine analyses and lead to poor decisions.

Data provenance concerns are not new. When we were building PlaceIQ, prospective clients would regularly ask us to include “raw data,” a supposedly natural state representing data at the moment it was created. However, in practice, the line where “raw” data becomes “cooked” was often fluid.

Data is created, not handed down by god.

At PlaceIQ, our understanding of movement in the real-world could be traced back to the signals collected from GPS, Bluetooth, and cellular antennas. From there, the operating systems – iOS and Android – would assemble a best guess at a coordinate pair, which an SDK or application would selectively log. At PlaceIQ, we’d interpret these streams as anonymized visitation and roll this up into packaged datasets. Where in this pipeline “raw data” exists means something different, depending on the client. Its definition was always a bit of a vibe, reflecting the mental model each buyer had for how the data was created and what it represented.

But “raw data” cannot adequately represent AI slop anxieties. Hence the rise of “authentic data”, a wholly new term which will certainly influence the data ecosystem. I expect data products will need to convincingly document their provenance and sales materials will need to provide clear narratives supporting the “authentic” status of each dataset (because it will be near impossible to discern objectively). There will surely be scandals, where data producers will pass off generated data as an “authentic” product, validating industry concerns.

This is a term to watch. And don’t forget:

If you want to know where the future is being made, look for where language is being invented and lawyers are congregating.

– Stewart Brand

Your Eval is More Important Than the Model

2025-01-08T09:40:00-08:00

Before You Pick a Model or Write a Prompt: Build an Eval

In this Cambrian era of LLMs – where new models drop every week – choosing a model for a product, pipeline, or project can be daunting. It’s tempting to throw your hands up and default to whatever OpenAI or Anthropic is offering this week. But if you’re building with AI, I strongly recommend against this “default.” Instead, take a step back and build your eval.

This might seem counterintuitive and/or a bit discouraging. It’s more fun to play with models, try out prompts, and push forward. Building a dataset is slow, tedious, and, well…not exactly exciting.

But evals are essential.

Last year, Hamel Husain wrote an amazing piece on evals, arguing, “If you streamline your evaluation process, all other activities become easy.” I’ll go a step further: your eval is the most valuable AI asset you own, not your model or your prompts.

A well-built custom eval lets you quickly test the newest models, iterate faster when developing prompts and pipelines, and ensure you’re always moving forward against your product’s specific goal. And perhaps the most important takeaway from last month’s deluge of new models is that builders must be flexible. A better, cheaper model will arrive tomorrow. Armed with a tailored eval, you can evaluate it in an afternoon.

As OpenAI’s Greg Brockman said, “Evals are surprisingly often all you need.”

In this post, we’ll build our own eval and show how quickly it can inform decisions and facilitate development.

Sections

Assembling the Dataset
Generating Responses
Evaluating Responses
Our Results (click here to skip all the code!)
Iterating Faster With Our Eval
Building Your Own Eval is Essential

A Knowledge Bank Test: The J1k Eval

For the purposes of this post, let’s assume we’re building a Jeopardy-style iOS trivia app. The trivia questions are frequently updated and people can play against their friends or a computer. The “computer” in this case will be an LLM, preferably one small enough to run on the phone. We’d like to run the LLM at the edge to cut down on costs and to enable offline play.

We want to find the small model that can best answer trivia questions while running locally on a phone. To do this, we need to assemble a dataset, generate responses, and evaluate responses.

Assembling the Dataset

Given our product, we’re going to build a dataset from Jeopardy questions. We obtained a dataset of Jeopardy clues and their categories, answers, and dollar values. These dollar values – ranging from $100 to $1000 – are a good proxy for question difficulty. Assembling our initial eval dataset, we select 1,000 questions, each from a different category and evenly distributed among the dollar values. These records are output to a single JSONL file we call test_questions.jsonl.

Here’s an example record:

{
    "value": 500, 
    "category": "FLAGS", 
    "clue": "Term for a national flag displayed on a ship, it's also the lowest commissioned rank in U.S. Navy", 
    "answer": "ensign"
}

This is our eval dataset, which we call J1k¹.

Generating Responses

Armed with this data, we need to generate responses from a variety of models, tracking their output and speed. A simple Python script will do:

import json
import time
import dspy

# Load the test questions
def load_test_questions(file_path, num_questions=1000):
    with open(file_path, 'r') as file:
        questions = [json.loads(line) for line in file]
        questions = questions[:num_questions]
    return questions

test_questions = load_test_questions('test_questions.jsonl')

# Specify the models we wish to test
models = [
    'llama3.2:1b',
    'llama3.2:latest',
    'llama3.3:latest',
    'llama3.1:latest'
]

# Define our DSPy Signature
class BuzzIn(dspy.Signature):
    """Answer a trivia question with the correct answer."""

    clue: str = dspy.InputField()
    answer: str = dspy.OutputField()

buzz_in = dspy.Predict(BuzzIn)

# Generate answers for each model and log the time taken
with open('time_to_compute.txt', 'w') as time_log:
    for model in models:
        start_time = time.time()

        with open(f'answers/answers_{model}.jsonl', 'w') as f:
            for question in test_questions:
                with dspy.context(lm=dspy.LM(f"openai/{model}", api_base='http://localhost:11434/v1', api_key='ollama')):
                    answer = buzz_in(clue=question['clue'])
                    f.write(json.dumps({
                        "clue_id": question['id'],
                        "model": model,
                        "clue": question['clue'],
                        "answer": answer.answer,
                        "correct_answer": question['answer']
                    }) + '\n')

        # Log the test time
        end_time = time.time()
        elapsed_time = end_time - start_time
        time_log.write(f"{model}: {elapsed_time:.2f} seconds\n")
        # Print the status
        print(f"Model {model} took {elapsed_time:.2f} seconds")
        print(f"Answers for model {model} written to answers_{model}.jsonl")

This is ~50 lines of code which can generate answers from any model we wish. We’re using the DSPy framework again, as it helps us quickly stand up our prompt, parse the output, and easily cycle through different models.

But how you generate responses doesn’t need to be complex. What matters most when getting started is the ability to easily change the model(s) you’re using.

Evaluating the Responses

To evaluate our responses, we’ll be using an LLM as a judge² to determine if the generated answers are correct. But before we write our evaluator, we need to scan our responses and understand how the models fail.

Here’s an example of a clearly wrong answer:

Clue: "Member of the lute family mentioned in the lyrics of 'Oh Susanna'"
Answer: "banjo"
LLM Response: "Bass"
Model: llama3.2:1b

If all the wrong answers were this straightforward, we wouldn’t need to use an LLM judge. But unfortunately, there are plenty of correct answers that would fail simple comparison tests.

Here’s one

Clue: "This Connecticut Yankee invented the revolver in 1836, not '.45'"
Answer: "(Samuel) Colt"
LLM Response: "The correct answer is Samuel Colt."
Model: llama3.2:1b

The model got it right, but the correct answer has parentheses (indicating the first name is optional on Jeopardy) and the response is a complete sentence.

Scanning over the responses, we make a list of common discrepancies:

Varying punctuation (“ten gallon hat” vs “ten-gallon hat”)
Optional articles (“Danube” vs “the Danube”)
Pluralization (“Cat” vs “Cats”)
Capitalization (“Bargain” vs “bargain”)
More or less specificity (“The Biograph” vs. “The Biograph Theater”)
Spelling numbers (“2” vs “Two”)

These discrepancies are common enough that it’s worth formatting these strings and performing some simple match techniques before throwing the call to our LLM judge:

Here’s our string cleaner:

def prepare_answer_text(answer: str) -> str:
    answer = answer.lower()
    # Remove punctuation
    answer = answer.replace('(', '').replace(')', '')
    answer = answer.replace('.', '').replace(',', '').replace('!', '').replace('?', '')
    answer = answer.replace('\'', '').replace('\"', '')
    # Remove articles
    articles = ['a ', 'an ', 'the ']
    for article in articles:
        if answer.startswith(article):
            answer = answer[len(article):]
    return answer

And our initial evaluator:

import editdistance
import inflect
p = inflect.engine()

def evaluate_model_answer(given_answer, correct_answer):
    correct_answer = prepare_answer_text(answer=str(correct_answer))
    given_answer = prepare_answer_text(answer=str(given_answer))
    
    # Check if the distance between the two is less than 2
    distance = editdistance.eval(correct_answer, given_answer)
    if distance < 2:
        return True
    
    # Check if the edit distance is .30 or less of the length of the correct answer
    if distance <= len(correct_answer) * .30:
        return True
    
    # # Then check if they match if either is pluralized
    try:
        given_answer_plural = p.plural(given_answer)
        if correct_answer in given_answer_plural or given_answer_plural in correct_answer:
            return True
    except:
        return False
    
    # # Check if either is a number and compare numeric spelling
    if correct_answer.isdigit():
        correct_answer_in_words = p.number_to_words(correct_answer)
        if given_answer in correct_answer_in_words:
            return True

    return False

There are, of course, responses which will frustrate these comparisons. Here’s one:

Clue: "Not the Master of the House, but the Master of the this is responsible for the sovereign's carriages & certain animals"
Answer: "Horse"
LLM Response: "Master of the Horse"
Model: llama3.3:latest

This is why we need to fall back to an LLM judge, which we define like so:

class JudgeResponse(dspy.Signature):
    """Evaluate a model's answer to a Jeopardy clue, compared to the correct answer."""

    clue: str = dspy.InputField(desc="the Jeopardy clue")
    correct_answer: str = dspy.InputField(desc="the correct answer to the clue")
    given_answer: str = dspy.InputField(desc="the answer given by the model")
    correct: bool = dspy.OutputField(desc="whether the model's answer is correct or not")

judgeResponse = dspy.Predict(JudgeResponse)

For each response, we then follow this process:

simple_judgement = evaluate_model_answer(given_answer, correct_answer)
if naive_evaluation:
    print("Simple Correct")
else:
    llm_judgement = judgeResponse(clue=clue, correct_answer=correct_answer, given_answer=given_answer)
    if llm_judgement.correct:
        print("LLM Correct")
    else:
        print("Incorrect")

Replace those print statements with logging calls and step through all your responses.

This multistep pipeline – a compound AI system – might seem complex, but it saves us time: for some models, 60% of responses never hit an LLM. Further, you (or, preferably, your domain expert) should be reviewing the responses by hand anyway to understand how failure occurs, so you might as well spot the simple judgments.

The above evaluation pipeline is simple but it’s ready to grow. As you grow your dataset and get more examples of correct and incorrect judgments calls, you can optimize your DSPy judge signature and potentially step it down to a smaller model – speeding up your pipeline further. Remember: one of the big advantages of having an eval is that it speeds up iteration, letting your products advance faster. Improving your eval data and your harness further improves your velocity, yielding cumulative benefits.

Our Results

We threw four llamas and six Qwen 2.5 variants at our eval, yielding these results:

We see a clear correlation between a model’s size and its ability to function as a “knowledge bank.” The Qwen 2.5 section provides the clearest illustration of this, as every step up in size yields an increase in accuracy, though this has diminishing returns. Going from 0.5b parameters to 1.5b triples the size while increasing performance by 65%. But going from 14b to 32b more than doubles our parameters while only yielding a 6% accuracy gain.

Increasing the size of the model also increases our inference time, something that will directly affect a model’s suitability as a trivia opponent:

Looking at this view crystalizes our options: increasing our parameters beyond 8b significantly increases our inference time in exchange for a slight improvement in accuracy³. (I left Llama 3.3 70b off this chart: it took ~1,300 seconds.)

Iterating Faster With Our Eval

Having an eval doesn’t just help with model selection – it helps as you try to improve the ability of your model.

For our iOS trivia app, we’ve determined that while an 8b parameter model delivers great results, it takes too long to produce them. We don’t want our players waiting that long, so let’s try improving Llama 3.2 3b.

We have a few options to increase its accuracy:

Tool Use: Enable the LLM to reference a remote or local dataset to inform its answers⁴
Prompt Engineering: Can we write a prompt that’s better suited to answering Jeopardy questions? Should we encourage chain-of-thought or use a multi-shot technique?
Fine Tuning: Can we fine tune our model on a knowlege base relevant to our trivia questions? This is our last resort as it’s the most effort and will likely increase our model size. Not a great fit for our goals here.

Thanks to DSPy, we can try a new prompting technique with one change:

class BuzzIn(dspy.Signature):
    """Answer a trivia question with the correct answer."""

    clue: str = dspy.InputField()
    answer: str = dspy.OutputField()

# buzz_in = dspy.Predict(BuzzIn)
buzz_in = dspy.ChainOfThought(BuzzIn)

The ChainOfThought module executes our signature with a prompt that asks the LLM to think step-by-step before committing to an answer.

Does it work?

Yes! Chain-of-thought prompting yields a 14% accuracy boost. If we ask DSPy to optimize our signature, using our dataset of Jeopardy clues and answers as a training set, we see a full 20% performance boost.

But our eval indicates there’s a cost: chain-of-thought means more inference which means this accuracy takes more time. This simple test took 10x longer than our initial technique. Too long for general usage, but you can envision some cases where it’s appropriate to give a computer opponent more time to ponder.

Building Your Own Eval is Essential

Every time a new model launches, the same alphabet soup of benchmarks is rolled out. Check out the just-landed Phi-4:

These figures give us a pretty good sense of Phi-4’s capabilities, but the quirks of each model – and how much they matter to you – can only be determined with an eval designed for your use case. None of these evals are built with your app in mind.

The AI landscape moves incredibly fast – last year’s state-of-the-art is today’s baseline. Having an eval gives you a constant north star. Our J1k eval demonstrates how easy it is to get started and how quickly we can realize the benefits: it revealed the tradeoffs between size and performance, helped us select a model suited to our use case, and gave us a framework for rapidly testing improvements and whatever novel model lands tomorrow.

The real power of a custom eval isn’t just in model selection – it’s in the compound benefits it delivers over time. Each new model can be evaluated in hours, not weeks. Each prompt engineering technique can be tested systematically. And perhaps most importantly, your eval grows alongside your understanding of the problem space, becoming an increasingly valuable asset for your AI development.

If you want a copy, shoot me a note. ↩
While looking for a page to serve as a backgrounder on the LLM-as-a-judge concept, I learned Hamel Husain has written a deep dive on this technique. It’s pure gold. ↩
I have no idea why Qwen 2.5 0.5b is taking as long as it does. I reran this test several times with several different configs and the results were consistent. I’m curious though… If you have an explanation, please, let me know ↩
Our example is getting a little contrived now, but work with me here! ↩

Beware the Cyren’s Song

2025-01-01T17:30:00-08:00

Sometimes you just want to put a word on something. Crystalize it into a coherent shape so we can talk about it better. Today that word is “Cyren.”

A cyren is a bot, masquerading as a human, designed to engage you in a relationship for an ulterior reason. Cyrens may be used to generate subscription revenue, advertising impressions, spread disinformation, or just waste your time. These bots aren’t “honeypots”, because there isn’t an actual human. They’re “cyrens”, portmanteau of “cyber” and of “siren,” the mythical creatures with enchanting voices and irresistible songs that lured passing sailors to their doom.

Giving cyrens a name lets us better discuss and consider them, both their benefits and dangers. Recent developments suggest we’ll be needing the help.

The Cyrens of Ashley Madison

Perhaps the best example of a cyren was discovered in 2015 after a hacking group stole and leaked data from the affair-seeker social network, Ashley Madison. Annalee Newitz, digging beyond the initial doxxing of millions of would-be cheaters, found evidence that a good chunk of female accounts on the site were bots, dubbed, “engagers.”

Revisiting the incident last July, Newitz wrote:

What I discovered was a bizarre scam – though it was far more like Westworld than US reality show Cheaters. The company had systematically created an army of fake women, mostly very simple chatbots called engagers, who would flirt with men to lure them into paying for a subscription to the site.

As we poured over the code, we found that, although there were a few human women on the site, more than 11 million interactions logged in the database were between human men and female bots. And the men had to pay for every single message they sent. For most of their millions of users, Ashley Madison affairs were entirely a fantasy built out of threadbare chatbot pick-up lines like “how r u?” or “whats up?”

In 2024, Newitz continues, the Ashley Madison “engager” story foreshadowed our AI present:

Nine years later, this could describe any number of social media sites that have become swamped with bots and AI-generated absurdity – and charge you for the privilege of interacting with techno-phantoms. Currently, Facebook is trying to figure out how to deal with millions of fake images generated by AI, while Google’s AI bot Overviews is telling users to glue cheese to pizza. The problem is, human beings are interacting with these AI images and suggestions, in some cases imagining they are engaging with real people.

Now, these examples aren’t cyrens. The Facebook example is slop, AI-generated content shotgunned onto Facebook, Instagram, and other platforms for farming engagement. Google’s pizza-glue incident was the result of badly designed AI features, either hallucinating or grabbing bad info and repackaging it.

But the cyrens are coming.

Cyrens as a Service

Here’s Christina Criddle and Hannah Murphy, writing for the Financial Times last Friday:

[Meta] is rolling out a range of AI products, including one that helps users create AI characters on Instagram and Facebook, as it battles with rival tech groups to attract and retain a younger audience.

“We expect these AIs to actually, over time, exist on our platforms, kind of in the same way that accounts do,” said Connor Hayes, vice-president of product for generative AI at Meta. “They’ll have bios and profile pictures and be able to generate and share content powered by AI on the platform . . . that’s where we see all of this going,” he added. Hayes said a “priority” for Meta over the next two years was to make its apps “more entertaining and engaging”, which included considering how to make the interaction with AI more social.

Over at the Intelligencer, John Herrman writes:

[Meta has] surely noticed that its platforms are already filling with AI slop anyway and that some of this slop was creating a lot of engagement, meaning that, in the ways that matter most to Meta, it’s not really slop at all. The company also clearly noticed the rise of Character.ai, the popular — but possibly doomed — lawsuit magnet of an app in which young users create and chat and act out fictional scenarios with AI characters.

For those unfamiliar, Character.ai is an example of cyrens as a service – a field that also includes apps like Replika. These apps are designed to entertain, teach, or simply keep you company.

In an interview with Axios, Eugenia Kuyda – the founder of Luka, the company behind Replika – said, “It doesn’t matter if an AI is real or not, the feelings are real.” Luka and a research team at Stanford even published a study in Nature finding a decrease in suicidal ideation among users of LLM-powered chatbots. Though I’m not sure Luka cites this paper much today, given in the ensuing months Character.ai was sued by a mother whose son died by suicide following a relationship with a Daenerys Targaryen.

The cyrens of Character.AI and Replika aren’t presented as real humans, but that’s not a requirement for the definition. Their cyrens still engage millions of people for hours a day, encouraging subscriptions to premium tiers.

Which is probably why Meta is rolling out their own.

Lash Yourself to the Mast

I don’t think chatbots are inherently bad, so long as users know they’re talking to a bot and they don’t begin to replace human interaction. When those things start to occur, I worry¹.

I worry Meta – and others – will find it’s easier to generate conversation partners than it is to discover conversation partners.

We don’t need more reasons to avoid interactions with people different from us.

And while a disclosed Facebook cyren boosting stories with synthesized commentary is much better than slop (it reminds me of an automated DJ on a radio station), the Character.ai cases demonstrate both the dangers and the fine lines we’re playing with here. Humans are hard-wired to see humans everywhere. Even when we disclose that a cyren is a cyren, we fall for the façade. As Kuyda said, “It doesn’t matter if an AI is real or not, the feelings are real.”

I believe chatbots can help ease loneliness. I also believe they can increase isolation.

I do not believe we can perfectly balance this dynamic even if we could perfectly control the output of LLMs (remember pizza glue?). A safer course is to err towards worse, imperfect cyrens, rather than pursue incredibly convincing, aligned ones.

I don’t worry about superintelligent AGI’s taking over the world. I worry about bots convincing people they’re having an emotional connection when they’re not.

Though even this blanket statement is reductive. dAIsy, an “AI bot built to speak with scammers on the phone for as long as possible,” keeping them from exploiting real humans, is a great example of a good cyren. ↩

Generating a Glossary from a Jekyll Blog Using DSPy & Claude

2024-12-27T12:47:00-08:00

“If you want to know where the future is being made, look for where language is being invented and lawyers are congregating.” – Stewart Brand

This is certainly true for AI. Let’s set [the lawyers] aside for a bit and focus on the words. (Though frequently the law and the words can’t be pulled apart!)

Anyone who writes about AI for a wide audiece will find themselves explaining the same terms over and over. You can’t know what a reader brings with them to a piece and the pace of the field keeps befuddles experts, who debate over the meaning of common terms.

This repetition of definitions inspired the creation of an AI glossary for this site. This is a work in progress with a significant backlog, but today I’d like to share how it got started, using [DSPy][dspy], Claude 3.5 Haiku, and some Jekyll features.

This site is a Jekyll site. Jekyll is a static blogging engine written in Ruby. By “static”, we mean there’s no server – just files. These files are generated and assembled using the Jekyll software, parsing all my markdown files and HTML templates to build a complete site.

We’re going to create a small Python script which will prepare and pipe all these markdown posts past an LLM in order to identify potential glossary terms and definitions.

First, let’s set up DSPy and point it at Claude:

import dspy

# Set up DSPy and the LM
lm = dspy.LM('anthropic/claude-3-5-haiku-latest', api_key='YOUR_API_KEY')
dspy.configure(lm=lm)

The last time we used DSPy, we explored how it works and how it generates and optimizes prompts for you given some light structure and definitions. This time, we want a more complex object returned to us; not just a glossary term, but also its definition, synonyms, acronym (if it has one), and expounding details from the post. Complicating this is that we want many terms per post – DSPy needs to return an array of fully defined terms.

Thankfully, DSPy works nicely with Pydantic, a data validation library that lets us define our desired term object:

from pydantic import BaseModel

# Define the Term object we want returned
class Term(BaseModel):
    term: str = dspy.OutputField(desc="A glossary term, like: a technical term specific to the subject matter, a concept crucial to understanding an article's main ideas, a term explicitly defined or explained in a post, or a word or phrase that are frequently used or emphasized in the post. Do not include the abbreviation in the 'term' field.")
    abbreviation: str = dspy.OutputField(desc="Populate the abbreviation field if the term is abbreviated in the article, ensure that it is not pluralized. If there is no abbreviation, populate the abbreviation field with an empty string.")
    definition: str = dspy.OutputField(desc="A definition of the term. Lightly edit the definition so it can stand alone outside the context of the post, but ensure that you do not add any information that is not present in the original text.")
    details: str = dspy.OutputField(desc="Text from the post that expounds a bit on the term, adding texture and details beyond the definition. The 'details' field can be empty if there is no additional context to provide and multiple paragraphs if there is more than one piece of context to provide.")
    synonyms: List[str] = dspy.OutputField(desc="Any synonyms, acronyms, or alternative terms that are used in the post")

Here we’re not only defining the attributes of each Term we want returned, but also lightly describing each attribute. DSPy will notice these types and descriptions and use them in its instructions to the LLM.

At first, this felts needlessly wordy. And if we’re going to get this detailed, why not just fall back to a standard long prompt, complete with example formatting?

For one, I really like the way this breaks down the prompt into it’s separate components. It’s easier to navigate the Term descriptions than it is to eyeball a wall of triple-quoted string. Adding or removing an attribute of the Term definition is simple.

Also, DSPy manages the extraction of the structured data from the prompt. By defining my signature like so, I can call it and get back an list of populated Term objects without mucking about with the raw text reply:

# Find key terms for the post and terms where their definition might not be clear to the reader
class ExtractTerms(dspy.Signature):
    """
    Find key terms for the post and terms where their definition might
    not be clear to the reader, from a markdown blog post. Ignore all 
    text between markdown code blocks.
    """

    post: str = dspy.InputField(desc="the markdown blog post")
    terms: List[Term] = dspy.OutputField(desc="Array of glossary terms.")

extractTerms = dspy.Predict(ExtractTerms)

This can them be called with:

terms = extractTerms(post=MY_MARKDOWN_POST_STRING).terms

Now we can go through each post, get the terms for that post, and note which post the term was found in:

# Get the terms from the posts
posts_path = Path("../_posts")
glossary = []
for post_file in sorted(posts_path.glob('*.md')):
    print(f"Processing {post_file}")
    with open(post_file, 'r') as f:
        post_content = f.read()
        # Remove any YAML frontmatter if it exists
        post_content = re.split(r'\n---\n', post_content, maxsplit=2)[-1]
        try:
            terms = extractTerms(post=post_content)
        except Exception as e:
            print(f"Failed to process {post_file}: {e}")
            continue
        for term in terms.terms:
            # We convert our term object to a dict so we
            # can save our post path
            term_dict = term.dict()
            if term_dict['term'] not in glossary:
                if str(post_file).startswith('../'):
                    term_dict['path'] = str(post_file)[3:]
                else:
                    term_dict['path'] = post_file
                print(f"Adding term {term_dict['term']}")
                glossary.append(term_dict)

If the same terms are identified in multple posts (and they were), we’re going to have duplicate terms in our glossary list. We can merge it, capturing each post that cited a given term and concatenating their details.

# Compare two term dicts to see if they are the same term
def compare_terms(term1, term2):
    if term1['term'].lower() == term2['term'].lower():
        return True
    if any(syn.lower() in [s.lower() for s in term2['synonyms']] for syn in term1['synonyms']):
        return True
    if term1['term'].lower() in [s.lower() for s in term2['synonyms']]:
        return True
        
    return False

# Condense the glossary by finding identical terms and merging their definitions, details, and synonyns.
merged_glossary = {}
for term in glossary:
    found = False
    for key in merged_glossary:
        if compare_terms(term, merged_glossary[key]):
            found = True
            merged_glossary[key]['details'] += "\n\n" + term['details']
            merged_glossary[key]['synonyms'] += term['synonyms']
            merged_glossary[key]['pages'].append(term['path'])
            merged_glossary[key]['synonyms'] = list(set(merged_glossary[key]['synonyms']))
            break
    if not found:
        page = term['path']
        term['pages'] = [page]
        merged_glossary[term['term']] = term

Then we sort and save it to the _data directory:

# Sort the merged_glossary by keys
sorted_glossary = dict(sorted(merged_glossary.items()))

# Create the _data directory if it doesn't exist
Path("../_data").mkdir(parents=True, exist_ok=True)

# Write the sorted glossary values to a YAML file
with open('../_data/glossary_gen.yaml', 'w') as yaml_file:
    yaml.dump(list(sorted_glossary.values()), yaml_file, default_flow_style=False, sort_keys=False)

We’re calling it glossary_get.yaml here because our final glossary will simply be glossary.yaml. We’ll hand review and edit the generated output, renaming it to the simpler name when we’re done. That way any future generation won’t overwrite our hand-polished file.

YAML files in the _data directory are handled specially by Jekyll. The YAML (or CSV or JSON) is read in as an object which we can reference during the building of our site.

Our glossary page page uses some light templating to render every term.

But even better, we can solve our original problem with Jekyll’s include feature, which is similar to Rails’ partials. Let’s create _includes/term.html like so:

 class="term">
{% for item in site.data.glossary %}
    {% if item.term == include.term or item.abbreviation == include.term %}
        {% if item.abbreviation == "" %}
            {{ item.term }} 
        {% else %}
            {{ item.term}} ({{ item.abbreviation}})
        {% endif %}
        {% if item.synonyms.size > 0 %}
             class="aka"> class="aka-header">Also known as {{ item.synonyms | join: ", " }}
        {% endif %}
             class="definition">
                {{ item.definition }}
            {% if include.show_details == "true" %}
                {{ item.details | markdownify }}
            {% endif %}


    {% endif %}
{% endfor %}

Add some CSS styling and we can add this line to any future post:

{% include term.html term="RLHF" %}

Which yields:

Reinforcement Learning from Human Feedback (RLHF)

Also known as alignment training, human-guided AI training

A training technique where human contractors provide feedback to improve AI model outputs, correcting problematic responses and guiding the model's behavior.

We can add an extra parameter to expound a bit:

{% include term.html term="RLHF" show_details="true" %}

Which get us:

Reinforcement Learning from Human Feedback (RLHF)

Also known as alignment training, human-guided AI training

A training technique where human contractors provide feedback to improve AI model outputs, correcting problematic responses and guiding the model's behavior.

RLHF is primarily used to make LLMs easier to use. ChatGPT’s breakthrough can partially be chalked up to OpenAI’s use of RLHF to train a base GPT-3 model for chat interactions. Prior to ChatGPT, most LLMs were text-completion models, not conversation models. RLHF is also used to make sure models behave. For example, OpenAI uses RLHF to ensure that ChatGPT doesn’t generate toxic or inappropriate responses. The company has a team of human contractors who provide feedback on the model’s outputs, helping to reduce harmful content and improve the model’s alignment with human expectations. This task can be emotionally taxing, as workers must review violent or sexual content to guide a model’s behavior.

Using an LLM to speed up the assembly of a glossary was a huge help. Our initial YAML output was over 2,000 lines. Pruning off-topic terms and tweaking details took an hour or so.

The initial scripting with DSPy took only a dozen or so minutes. The speed at which DSPy lets you get to a proof-of-concept is impressive, preventing you from playing whack-a-mole with a long prompt and giving you scaffolding for future iteration and optimization.

If you’d like to try this out with your own site, you can find all my code here. Be sure to let me know how it goes!

On Test-Time Compute: The New Game in Town

2024-12-20T10:07:00-08:00

What can we learn from December’s LLM blitz and o3’s arrival?

December was such a big month for LLMs, looking at the previous 11 months in the rear-view it feels like everything was a quiet build up to this moment. Both Ethan Mollick and Simon Willison posted nice wrap-ups, check those out if you want all the details.

Let’s quickly run down the hits:

Meta shipped Llama 3.3 instruct, a high-quality and efficient open model capable of fitting GPT-4 performance on higher-end consumer laptops, as well as efficiently running on the usual endpoints.
Microsoft released Phi-4, a lightweight model that excels at STEM tasks. The Phi team continues to push the boundary on effectively using synthetic data to train lightweight yet capable models.
Google pushed out Gemini 2.0 Flash two weeks ago, an excellent LLM capable of streaming voice, text, and video conversations. They then fast-followed with Flash Thinking Mode – a reasoning model on par with o1-pro.
And then there’s OpenAI’s o3, a new reasoning model that delivered step-change performance on the most respected reasoning benchmark, ARC-AGI-1. This test is hard: GPT-3 scored 0% and 4 years later GPT-4o scored 5%. The o3 model scores 75%. This score is dramatic enough that it solidifies the o-series’ tactics – leveraging chain-of-thought training, searching for several approaches to a problem, then applying and backtracking if necessary these approaches to a task – as the new scaling law. Work on these will be the focus for 2025.

There will be much written about o3, but François Chollet’s piece on it is excellent. Go read it.

This blitz of new models shows the pace of innovation isn’t slowing down, despite the demise of LLM’s foundation scaling law. Looking at this all, there’s a few immediate take-aways:

The best models will think longer: “Test-time compute” (aka, spending more time printing tokens to reason about a problem) is now fully established as a new scaling law. The o-series’ record on ARC-AGI underscores this (see the image above): spending 175 times more compute per task moved o3’s ARC-AGI score from 7% to 87%.
We’re gonna need more chain-of-thought training data: To enable effective longer “reasoning” we need chain-of-thought data where arguments and presentations are spelled out linearly. These datasets are either hand-created (hiring a bunch of math undergrads was OpenAI’s first approach) or synthesized (the Phi team at Microsoft use larger models to extract reasoning from high quality content). We need this data to teach LLMs to “think” longer. There will be continued investment in creating this more complex training data.
There will be an increased focus on inference: We’re going to spend a lot more on inference this year – (1) to generate the synthetic reasoning data we need to train new models, (2) to generate longer answers from chain-of-thought models, and (3) to bring these tiny GPT-4 class models to life at the edge. I think this is going to get interesting. There’s potential for gains from optimized NPU software and alternatives to NVIDIA are more relevant for inference use cases, especially when you don’t need a large mult-node cluster.
Builders need to stay flexible: If you’re building apps or pipelines powered by LLMs, this month has underscored the need to remain flexible. No matter your task, a better and cheaper model is right around the corner. Invest in testing infrastructure specific to your use cases so you can repeatedly trial new models and move on.

Have thoughts? Send me a note

On Synthetic Data: How It’s Improving & Shaping LLMs

2024-12-18T10:20:00-08:00

Synthetic data is making LLMs better – especially smaller ones.

Trainers are rephrasing input data, using larger “teacher models” to distill it from unruly webpages into structured Q&A or step-by-step, linear reasoning. They’re reshaping the content from a form written by humans, for humans, into a form that closely resembles chatbot conversation.

This has many benefits. Distilling content down, concentrating its knowledge, allows you produce smarter, smaller models. Extracting reasoning instructions teaches models how to build up evidence, enabling new “reasoning” models.

But synthetic data isn’t a silver bullet. The nature of synthetic data makes it better for some jobs and weaker for others, especially those that can’t be tested at scale. And an increasing reliance on synthetic data for training LLMs is making them better at quantitative tasks (like coding and math) but not delivering similar results for other use cases.

Synthetic data is helping LLMs scale the data wall, but it’s doing so while creating a growing perception gap between those who use LLMs for quantitative tasks and those who use it for anything else, generating significant confusion.

This post is a primer on synthetic data and how it’s being used to build better LLMs. We’ll cover what it is, how it’s made, and how it’s improving and shaping LLMs.

Sections

What is Synthetic Data?
How is Synthetic Data Improving LLMs?
Rephrasing Content to Match Expected Interactions
Reorganizing Content Into Linear, Step-By-Step Reasoning
Synthesizing New Code
A Growing Reliance on Synthetic Data Creates a Perception Gap

What is Synthetic Data?

There isn’t a great primer on synthetic data and how it’s used in AI today. Even Ilya, when discussing what comes after the “pre-training era”, remarked, “But what does synthetic data mean? Figuring this out is a big challenge?”

Synthetic data is artificially generated data, as opposed to data captured from real-world events. To many, this sounds suspect: how can we create valuable data by just making it up? Well, we do so by using authentic data (often referred to as “seed data”) to establish the qualities of a dataset, which we then use as rules to guide the creation of synthetic data.

Initially, synthetic data was used to preserve privacy and confidentiality. Researchers want to analyze sensitive data that can’t be shared – raw Census responses, health records, financial transactions – so statistical models are built from the sensitive data capable of generating fully synthetic output. The new dataset, “has the same mathematical properties as the real-world data set it’s standing in for, but doesn’t contain any of the same information.”

Let’s start with a simple example: a company creating a CRM system wants to test a phone number detection system, but is unable to run the software on their customers’ data due to privacy and security restrictions. So developers review a sample of publicly available, human-entered phone numbers, note the patterns they contain, and write a small function to generate as many phone numbers as they need:

def generate_us_phone_number():
    """
    Generates a synthetic, human-input-style US phone number
    Returns:
        str: A random US phone number in a common human-input format.
    """
    formats = [
        "({area_code}) {prefix}-{line_number}",    # e.g., (123) 456-7890
        "{area_code}-{prefix}-{line_number}",      # e.g., 123-456-7890
        "{area_code}{prefix}{line_number}",        # e.g., 1234567890
        "{area_code} {prefix}-{line_number}",      # e.g., 123 456-7890
    ]
    
    area_code = f"{random.randint(200, 999)}"
    prefix = f"{random.randint(200, 999)}"
    line_number = f"{random.randint(1000, 9999)}"

    format_choice = random.choice(formats)
    return format_choice.format(area_code=area_code, prefix=prefix, line_number=line_number)

There’s little complexity here, but this function produces synthetic data allowing for privacy-safe testing. By observing patterns in the seed data, the developers developed a mechanism for randomly generating phone numbers within the patterns of observed reality.

These functions –– and the statistical models capturing the shape of the seed data –– can get much more complex. For example, Microsoft used a computer graphics pipeline to generate 1 million synthetic images of human faces.

While Microsoft’s Digiface dataset is valuable for its privacy-preserving qualities, it demonstrates several additional benefits of synthetic datasets:

Synthetic datasets have clean, consistent, always-present labels. Images gathered from the web frequently lack labels or sport incorrect labels. To quote Microsoft’s paper: “For example, the Labeled Faces in the Wild dataset contains several known errors, including: mislabeled images, distinct persons with the same name labeled as the same person, and the same person that goes by different names labeled as different persons.” This isn’t a problem with the DigiFace dataset, which uses the same software to generate and assign labels to faces.
Synthetic datasets have unlimited scale. The Digiface dataset picked 1 million faces arbitrarily; it could have kept running. Even our toy Python function can generate all the numbers we’d ever need (on my machine it took less than 2 seconds to generate 1 million phone numbers). Collecting authentic datasets takes time, money, and gets harder as you go.
Synthetic datasets reduce bias with breadth of coverage. Image datasets collected from public images tend to skew towards people more likely to have their picture taken, namely celebrities. And, “celebrity faces also have imbalanced racial distribution.” Further, these images are more likely to be better lit, with the person wearing makeup, and of high clarity. With synthetic datasets, this isn’t a problem. We can generate a wide array of people, distributed across all parameters.
Synthetic datasets reduce bias with depth of coverage. Public image sets skew towards poses and facial expressions people are likely to make when their photo is being taken. We aren’t usually smiling, but you wouldn’t know that from our pictures. With synthetic data, “we can render multiple images by varying the pose, expression, environment (lighting and background), and camera.”

DigiFace is split into two distributions –– one with 720k images from 10k distinct faces (72 images per identity) and one with 500k images from 100k identities. The consistency, coverage, and scale of this synthetic facial dataset allowed for training competitive facial recognition models, with only a small dataset of actual faces.

How is Synthetic Data Improving LLMs?

Using computer graphics to synthesize faces and small functions to generate phone numbers isn’t particularly perilous. Both programs have enough knowledge of their domains to prevent the creation of egregiously wrong data.

But when it comes to synthesizing text content for general models, the difficulty level scales up. In the same way we broke down the fundamental components of a US phone number (area code, prefix, line number) we could break down the fundamental components of most sentences (subject, verb, action) and paragraphs (topic sentence, supporting sentence, concluding sentence), but using that structure as guardrails to generate content would synthesize a whole bunch of nonsense. It’s okay if our phone number components are literally random. Using the same approach to text does not work.

Due to this, when training LLMs synthetic data is mostly used for rephrasing text content. The underlying content or knowledge remains the same, but how it’s expressed is shifted. For example, let’s provide Claude with the following prompt:

Rewrite the following sentence 3 different ways with an increasingly casual style: "The San Francisco Bay Area, commonly known as the Bay Area, is a region of California surrounding and including San Francisco Bay, and anchored by the cities of San Francisco, Oakland, and San Jose."

This yields:

The San Francisco Bay Area—or just "the Bay Area" as it's widely known—refers to the California region that includes San Francisco Bay and is home to the major cities of San Francisco, Oakland, and San Jose.

The Bay Area is what people call the part of California that wraps around San Francisco Bay, with San Francisco, Oakland, and San Jose being the main cities that anchor the whole region.

So basically the Bay Area is just all the stuff around SF Bay - you've got San Francisco, Oakland, and San Jose as the main spots, and that's pretty much what everyone means when they talk about it.

This is a toy example, but illustrates the general pattern. We obtain some seed data (our original sentence), a teacher model (the LLM doing the rephrasing work, here Claude 3.5 Sonnet), and provide it with instructions detailing how we want the seed rephrased (our prompt).

But why might we want to rephrase or reorganize our content? There are a few reasons…

Rephrasing Content to the Match Expected Interactions

If we rephrase our content so it more closely resembles our ultimate, conversational interaction with users we can train on less data, improve the capabilities of our models, and spend much less time on post-training. This technique is laid out by researchers at Apple and CMU, in their paper, “Rephrasing the Web.”

Originally defined, LLMs were text completion models: they take an input bit of text and spit out more text they’d predict might follow, based on their weights. OpenAI’s GPT-3 is a completion model, and by fine-tuning this model with human teachers providing example Q&A content they created ChatGPT, a conversational or chat model. Since then, models usually ship with a chat and instruct build, though some still ship completion versions.

The “Rephrasing the Web” paper suggests rephrasing text data in a Q&A format, partially so we can skip the post-training step where we teach completion models to converse. The team used Mistral 7B as a teacher model, prompting it to rephrase text in four different styles:

“Easy (text that even a toddler will understand)”
“Medium (in high quality English such as that found on Wikipedia)
“Hard (in terse and abstruse language)”
“Q/A (in conversation question-answering format)

Content from the C4 web content dataset was rephrased, in a sense distilled down to only content the ultimate model would likely receive or produce.

This not only resulted in 3x faster training – since there was less training data to process – but better models. The authors write, “re-phrasing documents on the web using an off-the-shelf medium size LLM allows models to learn much more efficiently than learning from raw text on the web, and accounts for performance gains on out of distribution datasets that can not be offset with additional web data.” In other words, rephrasing can yield gains that make up for a lack of web data.

Further, this rephrasing lets you skip much of post-training chat training: “Using synthetic data enables baking in desirable attributes such as fairness, bias, and style (like instruction following) directly into the data, eliminating the need to adjust the training algorithm specifically.” If we can use a model to reformat our data into Q&A form to start we don’t have to fine-tune the model afterwards.

In many ways this is a UX improvement. Translating pre-training data into Q&A or Wikipedia-style writing induces better formatting, correctness, and efficiency. But it doesn’t add new knowledge.

In fact, this translating data step – which I think is better thought of as “distillation” – is lossy. It removes data, while seeking to preserve the best knowledge. The Apple and CMU team observed this: their initial model would often fail when people provided input with linguistic errors or typos. To mitigate this, they trained the model on half rephrased data and half authentic web data, adding “noise” to capture the sloppy way humans communicate.

Synthetic data can make smaller models more efficient and effective by rephrasing data to match expected interactions. But because the initial data is merely getting distilled, no new knowledge is being added. With reasoning models showing promising results,

Reorganizing Content into Linear, Step-By-Step Reasoning

With reasoning models showing promising results, might we improve our data distillation process by extracting explicit reasoning steps? This is exactly the approach the Microsoft team behind Phi-4 used and it resulted in one hell of a model.

People don’t usually write in a structured, step-by-step linear fashion. They write more casually or even start with their conclusion before enumerating their support. This is at odds with the linear way LLMs process data. The Phi-4 team writes:

In organic datasets the relationship between tokens is often complex and indrect. Many reasoning steps may be required to connect the current token to the next, making it challenging for the model to learn effectively from next-token prediction… A simple example to illustrate this is that a human-written solution to a math problem might start with the final answer. This answer is much too hard to output immediately, for either a human or an LLM—the human produced it by nonlinear editing, but pretraining expects the LLM to learn to produce it linearly. Synthetic solutions to math problems will not have such roadblocks.

Rephrasing content as linear, step-by-step reasoning – matching the way LLMs generate tokens – might increase LLM efficiency and performance.

The first challenge for the Phi-4 team was identifying authentic seed data capable of having its reasoning extracted.

They created question datasets from sources like Quora, AMAs on Reddit, or the questions LinkedIn prompts you to answer. High-quality content demonstrating, “complexity, reasoning depth, and education value,” was selected from web pages, books, scientific papers, and code. Much of Microsoft’s previous Phi work dealt with this filtering methodology. In some cases, these high-quality selections were rephrased into Q&A content, similar to the previous example. All of this went into their seed pile.

Using GPT-4o as a teacher, the team transformed the seeds into synthetic data, “through multi-step prompting workflows..rewriting most of the useful content in given passages into exercises, discussions, or structured reasoning tasks.” This synthetic dataset made up 40% of Phi-4’s training data, with more direct web rephrasing making up an additional 15%. More than half of Phi-4’s pre-training corpus is synthetic!

This process delivered astounding results: on graduate-level STEM and math evaluations Phi-4 outperforms its teacher, GPT-4o. This is particularly impressive because Phi-4 contains only 14 billion parameters (GPT-4o likely has hundreds of billions). A 16gb verion of Phi-4 runs at ~20 tokens a second on my Mac Studio, fast enough for most use cases, and generates really good code.

Unquestionably this is LLM progress! The Phi team combines the rephrasing learnings with reasoning extraction techniques to generate synthetic data that produces a better model.

But it’s not a silver bullet. This translation does not replace the old scaling laws, that more data equals smarter models.

First off, these techniques only work for smaller models. This process relies on the presence of larger teacher models, like GPT-4o. The synthetic data techniques we’ve discussed so far distill knowledge from data. They do not add new information. This primarily helps create smaller models, as it reduces the size of the input data. The Phi-4 team calls out this nuance, as have other teams. The Llama 3 paper notes:

Models show significant performance improvements when trained on data generated by a larger, more competent model. However, our initial experiments revealed that training Llama 3 405B on its own generated data is not helpful (and can even degrade performance).

Synthetic data improves smaller models by leveraging large ones. The ceiling of LLM capabilities will likely rise slowly while small models continue to race forward. We’re left with a key question: who teaches the teachers?

Another weakness of this process is that distillation is a lossy. We can try our best to keep the most pertinent information, but we cannot prevent loss. While rephrasing improves the “reasoning” and “understanding” of these small models, we cannot get around the fact that fewer parameters hold less facts. “While phi-4 achieves similar level of language understanding and reasoning ability as much larger models, it is still fundamentally limited by its size for certain tasks, specifically in hallucinations around factual knowledge.”

Small models aren’t good knowledge banks.

We saw this pretty clearly in my piece exploring DSPy: when asking the 1 billion parameter Llama a question about a specific person resulted in hallucinations, while the 70 billion parameter Llama recognized the individual perfectly.

Finally, these synthetic data methods tend to overfit to an interaction paradigm. The Apple and CMU team had to mix in organic data to allow the model to understand messy, real-world interactions. The Phi-4 team suffered similar challenges: “As our data contains a lot of chain-of-thought examples, phi-4 sometimes gives long elaborate answers even for simple problems—this might make user interactions tedious.”

This effect can also be seen if you ask Phi-4 to output in a specific format. It’s not good at solely outputting JSON, bulleted structures, tabular data, stylistic requests ant other formats that aren’t the reasoning style it was trained on. When testing DSPy I tried Phi-3.5 (which shares this characteristic with Phi-4) before I tried Llama: it failed to work with DSPy at all, overloading DSPy’s attempts to coax a specifically formatted output.

While this quality is a noted weakness of Phi-4, I think it is better seen as a sign that these small models will become more specialized. “Chat”, “Instruct”, or “Code” variants will multiply into further specialties. Phi-4 is kicking this off, as it’s not quite a chat model and not quite an instruct model: “While phi-4 can function as a chat bot, it has been fine-tuned to maximize performance on single-turn queries.”

The rephrasing and reasoning extraction techniques we’ve described so far only distill. They concentrate knowledge from the data we already have, producing better, faster, smaller models.

But is there a way to synthesize new knowledge?

Synthesizing New Code

LLMs are getting better at programming because we can quantitatively test synthetic code. As noted above, we are limited when we create synthetic text data. We can’t randomly generate content according to a defined structure; we’re only able to rephrase and extract knowledge from existing data.

But this is not so with quantitative subjects like code! Teams have been generating wholly synthetic code, further from the input seed data, since at least Llama 3. The team at Meta ran the following process:

Generate a large collection of programming problems: Collect a ton of organic code snippets from your seed dataset, spanning a diverse range of topics. Present these to a teacher model and ask it to generate programming problems that could be answered by these examples.
Generate solutions for the problems: Prompt a large model to generate code which answer your problems, many times and in many different programming languages.
Evaluate the solutions: Run all the generated solutions through a linter (a tool that finds potential errors and bugs) and a parser to ensure the code runs without errors. Throw out all the potential solutions that fail. For the remaining solutions, use a model to write unit tests for the problem to ensure the result of the executable code outputs a desired result.
Try to Correct the Errors: Send the failing solutions back through the large LLM, noting their bugs or failures, and ask for a rewrite. The Llama team found that 20% of the incorrect solutions were corrected with one additional try.

This process, combined with more rudimentary language translation, generated the majority of the synthetic data used to train Llama 3. The Phi-4 team performed a similar process as well, applying validation to all quantitative synthetic data they could test.

This process works: Phi-4 scores higher on coding benchmarks, “than any other open-weight model we benchmark against, including much larger Llama models.”

I find this ridiculously fascinating. These compound AI systems for generating new knowledge from seed data are ingenious and effective. LLMs, especially small models, have gotten incredible at code over the last two years. And while programming most naturally fits this approach, we might apply this problem-solution-check-correct pattern to any quantitative task.

A Growing Reliance on Synthetic Data Creates a Perception Gap

Spend some time reading technical papers for new models and you’ll notice a theme: a good chunk of the content deals with quantitative problems. Math and code are the focus right now, with new and complex synthetic data pipelines refashioning seed data and testing the results. The headline evaluations are quantitative tests, like MATH and HumanEval. Synthetic data is pushing models further and delivering improvements, especially in areas where synthetic data can be generated and tested.

There are additional sources of data that can help mitigate this bias. Proprietary user-generated data – like your interactions with Claude or ChatGPT – provide human signal and qualitative rankings. Hired AI trainers will continue to generate feedback that will tune and guide future models, but all of this relies on humans, which are slower, more expensive, and more inconsistent than synthetic data generation methods.

We will pick off testable use cases from qualitative domains – RAG performance, entity and snippet extraction from long texts – but most of the field will remain only accessible to distillation methods, notably rephrasing.

Synthetic data is a tool for scaling the data wall, but it’s lopsided. For non-quantifiable fields, it can only distill signals from existing data. It cannot create new knowledge.

As a result, we can expect the following:

Inference is now part of training: Building new models will require incredible amounts of inference, not just training. Multistep pipelines for generating synthetic data rely on teacher models to create the mountains of signal needed to push models forward.
Smaller models will get better faster: Smaller models will continue to improve while large models will progress slowly, as synthetic data techniques rely on larger, more capable teacher models. SemiAnalysis – in an informative piece arguing against an LLM slowdown – claims this dynamic is behind the delayed release of Claude 3.5 Opus: “Anthropic used Claude 3.5 Opus to generate synthetic data and for reward modeling to improve Claude 3.5 Sonnet significantly, alongside user data. Inference costs did not change drastically, but the model’s performance did. Why release 3.5 Opus when, on a cost basis, it does not make economic sense to do so?”
Models will get better at testable skills: Quantitive domains – like programming and math – will continue to improve because we can create more novel, massive, synthetic datasets thanks to unit tests and other validation methods. Qualitative chops and knowledge bank capabilities will be more difficult to address with synthetic data techniques and will suffer from a lack of new organic data.
An AI perception gap will emerge: Those who use AIs for programming will have a remarkably different view of AI than those who do not. The more your domain overlaps with testable synthetic data, the more you will find AIs useful as an intern. This perception gap will cloud our discussions.

As Ilya said in his talk, we’ll need new methods to push past the data wall. Synthetic data is an incredible tool, but it isn’t a silver bullet.

Have thoughts? Send me a note

Pipelines & Prompt Optimization with DSPy

2024-12-12T10:50:00-08:00

I stumbled across DSPy while looking for a framework to build a small agent (I wanted to try out some new techniques to make my weather site more interesting) and found its approach to prompting interesting. From their site, “DSPy is the framework for programming—rather than prompting—language models.”

And it’s true: you spend much, much less time prompting when you use DSPy to build LLM-powered applications. Because you let DSPy handle that bit for you.

There’s something really clean and freeing about ceding the details and nuance of the prompt back to an LLM.

Let’s quickly walk through how DSPy handles prompting for you and step through an simple categorization task as an example.

A Quick Intro to How DSPy Works

At first, DSPy reduces time spent prompting by providing you with boilerplate prompting that frames your tasks, which you define with “signatures”. Signatures are a way of expressing what you want an LLM to do by defining the desired input and outputs. They can be as simple as strings, like:

'question -> answer'

You can also specify your types as well, like:

'sentence -> sentiment: bool'

Instinctually, I started looking for a dictionary of input and output types for signatures. But there isn’t one: signatures can use whatever terms you’d like, so long as they’re descriptive of your desired inputs and outputs. For example:

'document -> summary'
'novella -> tldr'
'baseball_player -> affiliated_team'

Signatures can also be defined as a class, which lets you add further specs for more complex tasks. But we’ll get to that later.

Signatures define your desired work, but they are used to generate prompts by DSPy “modules”. For our purposes today, think of modules as runners which apply a specific set of prompt techniques to generate a prompt and run it against an LLM. The foundational module is Predict, which doesn’t do much out of the box besides frame your signature with some boilerplate instructions.

For example, given the signature, question -> answer and the input question, “What is the captital of France?” the Predict model will call an LLM with the following system prompt:

Your input fields are:
1. `question` (str)

Your output fields are:
1. `answer` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]
In adhering to this structure, your objective is:
    Given the fields `question`, produce the fields `answer`.

And an accompanying user prompt:

[[ ## question ## ]]
What is the capital of France?

Respond with the corresponding output fields, starting with the field `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`

(I’ve replaced the \n characters with newlines above, for legibility)

The module is performing some basic string formatting to contextualize your provided signature. If we were to use a different module –– DSPy provides ChainOfThought, ProgramOfThought, ReAct, and MultiChainComparison –– different prompt technoques would be used to contextualize and reformat your signature.

Off the bat, this is helpful for quick LLM tasks, especially if you’re a beginner with prompts. But where DSPy really shines is when you ask it to optimize your prompts based on a provided training set.

Using DSPy to Categorize Historic Events

To illustrate how we can optimize prompts with DSPy, we’re going to use a simple toy problem: categorizing descriptions of historic events. While we do yield some gains with the following, this is designed to be a demonstration rather than a real-world approach (for a few reasons).

We’ve gathered the event descriptions by scraping Wikipedia’s date pages, obtaining a whole mess of descriptions like, “Battle of Nineveh: A Byzantine army under Emperor Heraclius defeats Emperor Khosrau II’s Persian forces, commanded by General Rhahzadh.”

First, let’s set up DSPy by running pip install -U dpsy and the following lines:

import dspy

lm = dspy.LM('ollama_chat/llama3.2:1b', api_base='http://localhost:11434')
dspy.configure(lm=lm)

We’re using a Llama 3.2 1b, running locally via Ollama (though you could use any of numerous adaptors). I like to start with small models when getting set up, as they help you iterate faster. DSPy and Ollama makes it easy when we want to step up to a larger model, after we’ve got what we want running bug free.

We’re going to use a class-based signature because it lets us explicitly specify the categories we want our events categorized with:

from typing import Literal

class Categorize(dspy.Signature):
    """Classify historic events."""

    event: str = dspy.InputField()
    category: Literal[
        "Wars and Conflicts",
        "Politics and Governance",
        "Science and Innovation",
        "Cultural and Artistic Movements",
        "Exploration and Discovery",
        "Economic Events",
        "Social Movements",
        "Man-Made Disasters and Accidents",
        "Natural Disasters and Climate",
        "Sports and Entertainment",
        "Famous Personalities and Achievements"
    ] = dspy.OutputField()
    confidence: float = dspy.OutputField()

classify = dspy.Predict(Categorize)

# Here is how we call this module
classification = classify(event="[YOUR HISTORIC EVENT")

Let’s quickly look at what prompt the Predict module generates for us based off this definition when we pass in the event, ““Second Boer War: In the Battle of Magersfontein the Boers commanded by general Piet Cronjé inflict a defeat on the forces of the British Empire commanded by Lord Methuen trying to relieve the Siege of Kimberley.”

Here’s the system prompt:

Your input fields are:
`event` (str)

Your output fields are:
1. `category` (Literal[Wars and Conflicts, Politics and Governance, Science and Innovation, Cultural and Artistic Movements, Exploration and Discovery, Economic Events, Social Movements, Man-Made Disasters and Accidents, Natural Disasters and Climate, Sports and Entertainment, Famous Personalities and Achievements])
2. `confidence` (float)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## event ## ]]
{event}

[[ ## category ## ]]
{category}        # note: the value you produce must be one of: Wars and Conflicts; Politics and Governance; Science and Innovation; Cultural and Artistic Movements; Exploration and Discovery; Economic Events; Social Movements; Man-Made Disasters and Accidents; Natural Disasters and Climate; Sports and Entertainment; Famous Personalities and Achievements

[[ ## confidence ## ]]
{confidence}        # note: the value you produce must be a single float value

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
    Classify historic events.

You can clearly see where the bits of my signature class are being dropped into this prompt. Here’s the user prompt:

[[ ## event ## ]]
Second Boer War: In the Battle of Magersfontein the Boers commanded by general Piet Cronjé inflict a defeat on the forces of the British Empire commanded by Lord Methuen trying to relieve the Siege of Kimberley.

Respond with the corresponding output fields, starting with the field `[[ ## category ## ]]` (must be formatted as a valid Python Literal[Wars and Conflicts, Politics and Governance, Science and Innovation, Cultural and Artistic Movements, Exploration and Discovery, Economic Events, Social Movements, Man-Made Disasters and Accidents, Natural Disasters and Climate, Sports and Entertainment, Famous Personalities and Achievements]), then `[[ ## confidence ## ]]` (must be formatted as a valid Python float), and then ending with the marker for `[[ ## completed ## ]]`.

To which the LLM responds with:

I can help you classify historic events.

[[ ## event ## ]]
The Second Boer War: In the Battle of Magersfontein the Boers commanded by general Piet Cronjé inflict a defeat on the forces of the British Empire commanded by Lord Methuen trying to relieve the Siege of Kimberley.

[ ## category ## ]]
Wars and Conflicts

[[ ## confidence ## ]]
0.75

But when I instigate this interaciton by calling the Predict module, this is what we get back:

Prediction(
    category='Wars and Conflicts',
    confidence=0.75
)

Already, this is a huge win. We’ve spec-ed out our categorization problem in a few lines, in a way that it will be much easier to edit our potential categories, and got back structured results without having to get our hands dirty with prompting boilerplate or manipulating the LLM response.

But the actual answers? They’re okay…not great. Lots of war events are categorized as poltical events (which…fair, I guess) and other times a tricky keyword will throw the results. We could go through and hand sort the results, but let’s take advantage of DSPy’s ease of model switching to compare Llama 3.2 1b to the new, excellent Llama 3.3 70b.

Here’s how:

with open("0101_events.json", 'r') as file:
    data = json.load(file)
    events = pd.DataFrame(data['events'])
    
    # Using our small model
    with dspy.context(lm=dspy.LM('ollama_chat/llama3.2:1b', api_base='http://localhost:11434')):
        events['category_32_1b'], events['confidence_32_1b'] = zip(*events['description'].apply(classify_event))

    # Using our large model
    with dspy.context(lm=dspy.LM('ollama_chat/llama3.3', api_base='http://localhost:11434')):
        events['category_33'], events['confidence_33'] = zip(*events['description'].apply(classify_event))

    events.to_csv("model_compare.csv", index=False)

28 out of 59 times, the models disagree, with Llama 3.3 in the right. But, this comes at a cost: Llama 3.3 ran ~10x slower.

Llama 3.3’s size provides much more context to situate these events, many of which couldn’t be categorized without knowledge of their subjects. In these instances, there’s not much we can do to help Llama 3.1. Prompt engineering or fine-tuning isn’t going to add the needed diverse base knowledge needed for these calls.

But there’s enough near misses that I think some improved prompting can eek out some gains from the 1b model.

Optimizing Our Prompts With DSPy

An aspect about DSPy modules we haven’t yet discussed is that we can optimize them. To do this, we need to defined a metric and prepare some training data.

In DPSy, metrics are functions that take examples with ideal output and compare them to the output of our system. Here’s the one we’re going to use today:

def validate_category(example, prediction, trace=None):
    return prediction.category == example.category

As simple as it gets. If our example doesn’t match the output, it fails. (Checkout DSPy’s docs for details on the example object here)

Next, we’ll generate a training set of 300 categorized events using Llama 3.3:

# Generating example predictions from Llama 3.3
import os
import json
import pandas as pd

# Define a function to classify the event description
def classify_event(description):
    try:
        prediction = classify(event=description)
        return prediction.category, prediction.confidence
    except Exception as e:
        return 0, 0

with dspy.context(lm=dspy.LM('ollama_chat/llama3.3', api_base='http://localhost:11434')):
    # Directory containing the JSON files
    events_dir = 'events'

    # Iterate over all files in the directory
    for filename in os.listdir(events_dir):
        if filename.endswith('.json'):
            filepath = os.path.join(events_dir, filename)
            with open(filepath, 'r') as file:
                data = json.load(file)
                events = pd.DataFrame(data['events'])
                with dspy.context(lm=dspy.LM('ollama_chat/llama3.3', api_base='http://localhost:11434')):
                    events['category'], events['confidence'] = zip(*events['description'].progress_apply(classify_event))
                # Append the results to a global dataframe
                if 'lmma_events' not in globals():
                    lmma_events = events
                else:
                    lmma_events = pd.concat([lmma_events, events], ignore_index=True)
                
                # Break if the dataframe has more than 300 rows
                if len(lmma_events) > 300:
                    print('Breaking...')
                    break

# Save the results to a CSV file
lmma_events.to_csv('llama_3_3_trainset.csv', index=False)

These answers are great, but generating them took awhile. A good reminder why we should try to eek as much value out of smaller models for these exercises.

To see how our tiny model –– Llama 3.2 1b –– fairs, we can use DSPy’s evaluator functions:

import csv
import dspy
from dspy.evaluate import Evaluate

# Load the trainset
trainset = []
with open('llama_3_3_trainset.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        example = dspy.Example(event=row['description'], category=row['category']).with_inputs("event")
        trainset.append(example)

# Evaluate our existing function
evaluator = Evaluate(devset=trainset, num_threads=1, display_progress=True, display_table=5)
evaluator(classify, metric=validate_category)

51.9% of the time Llama 3.2 1b gets it right, about in line with our previous comparison. Nice to know this scales.

To improve our system, we specify an optimizer and ask DSPy to run it on our function with using our training data:

from dspy.teleprompt import *

# Load our model
lm = dspy.LM('ollama_chat/llama3.2:1b', api_base='http://localhost:11434')
dspy.configure(lm=lm)

# Optimize
tp = dspy.MIPROv2(metric=validate_category, auto="light")
optimized_classify = tp.compile(classify, trainset=trainset, max_labeled_demos=0, max_bootstrapped_demos=0)

Getting into the depths of DSPy optimizers is beyond the scope of this post, but we’re choosing MIPROv2 because we only want to optimize the prompt the module and signature are using. We aren’t fine-tuning any weights, just trying to find a way of prompting our LLM so we get results more in line with our desired output.

DSPy will use the LLM to generate other ways of prompting our model –– trying rephrases, using examples from our training set, and more –– to find a prompt which outperforms the boilerplate it generated above. As you stack modules and signatures, forming more complex prompting chains, this can get much more complex and obtain much better gains. Here we’re keeping it simple, using only one module and signature and asking that the optimizer not try few-shot prompts (aka prompts that involve a round or two of back and forth with the LLM).

And wouldn’t you know it? It works. Our optimizer raised our evaluation from 51.9% to 63.0%.

It did this by making one slight change to our prompt. Where it previous read:

...

In adhering to this structure, your objective is: 
Classify historic events.

It now generates:

...

In adhering to this structure, your objective is:
Classify historic events. Consider using synonyms for "landed", such as "arrived" or "descended". Also, try to include more context about Charles II\"s actions and their potential political consequences.

That second part is some very-specific over-fitting! Though the instructions to mind your synonyms seems benficial and more generic. And the results look…pretty good! Running the new signature on a wider batch of data and eyeballing the results appears promising.

But we can do better. DSPy has a really neat feature that lets us specify the model we want to use for the task itself and another model for generating prompts. This is perfect for us, as it lets us leverage the much better Llama 3.3 to come up with prompting strategies while evaluating them against the tiny 3.1 model.

Here’s how:

from dspy.teleprompt import *

# Load our model
lm = dspy.LM('ollama_chat/llama3.2:1b', api_base='http://localhost:11434')
prompt_gen_lm = dspy.LM('ollama_chat/llama3.3', api_base='http://localhost:11434')

dspy.configure(lm=lm)

# Optimize
tp = dspy.MIPROv2(metric=validate_category, auto="light", prompt_model=prompt_gen_lm, task_model=lm)
optimized_classify = tp.compile(classify, trainset=trainset, max_labeled_demos=0, max_bootstrapped_demos=0)

On first blush, this yields worse results: 62% vs our previous 63%. But the output looks much better on initial review. It’s easy to see how using a big LLM helped us avoid over-fitting and obtain better instructions.

Here’s the new modification:

...

In adhering to this structure, your objective is: 
Analyze the given historical event descriptions, which may pertain to various domains such as politics, science, conflicts, or cultural movements, and categorize each event into its most suitable category (e.g., Science and Innovation, Politics and Governance, Wars and Conflicts). Provide a confidence score for each categorization, indicating the level of certainty in assigning the event to its respective category. Ensure that your analysis is based on the content and context of the event description, utilizing natural language processing techniques to accurately determine the category and confidence score.

We use it like so:

classification = optimized_classify(event="Second Boer War: In the Battle of Magersfontein the Boers commanded by general Piet Cronjé inflict a defeat on the forces of the British Empire commanded by Lord Methuen trying to relieve the Siege of Kimberley.")
print(classification)

# We can save our optimization with:
optimized_classify.save("optimized_event_classifier.json")

Saving allows us to reload the optimized system during a different session.

DSPy is super useful, especially as your pipeline grows from a single, 0-shot call to a multistep, tool-using agent. The pattern of abstracting prompt generation away and leaving it to the models to figure out based on defined metrics is quite powerful.

Why LLM Advancements Have Slowed: The Low-Hanging Fruit Has Been Eaten

2024-12-05T09:29:00-08:00

We’re finally coming to terms with the idea that foundation LLMs have hit a wall.

Thanks to decades of data creation and graphics innovation, we advanced incredibly quickly for a few years. But we’ve used up these accelerants and there’s none left to fuel another big leap. Our gains going forward will be slow, incremental, and hard-fought.

Reviewing the history of machine learning, we can both understand how the field advanced so quickly and why LLMs have hit a wall.

Sections

The Internet Created Big, Open Datasets That Led to Breakthroughs
The Graphics Industry Funded the Development of GPUs That Let Us Process Big Datasets
LLMs Were Built With Three Decades of Internet Content & Graphics Innovations

The Internet Created Big, Open Datasets That Led to Breakthroughs

Data is a dependency for machine learning and AI progress.

In most computer programming, you explicitly write down the instructions that define your program. But with machine learning, we point a program at a pile of data and ask it to figure it out. The software comes up with rules, in the form of a model, which we then use to process new bits of data.

Sure, we’re glossing over the details, but this general pattern illustrates how machine learning – which includes LLMs – is not only limited and enabled by software and hardware, it is also limited and enabled by data. If there’s not much data or the data is of poor quality, the rules defined by machine learning software will be garbage.

Because access to data is a governor for machine learning, it’s possible to recount the history of machine learning by touching on three key datasets: MNIST, ImageNet, and Common Crawl. Each allowed for a major breakthrough to occur, proving the viability of machine learning and setting off countless new developers and investors into the domain.

MNIST: A Small, Specialized Dataset Originally Delivered By Mail

Back in 1994 –– when the Internet was nascent and the number of websites numbered in the thousands –– the National Institute of Standards and Technology published a dataset of handwritten digits, which they distributed on two CD-ROMs. At the time, the dataset was a goldmine, but it wasn’t perfect. Yann LeCun – who had been working on neural networks that could read handwritten numbers at Bell Labs – tweaked the original NIST dataset to produce a more representative mix of samples, pre-formatted for neural network usage: numbers were centered into 28x28 pixel images, anti-aliased, and divided into ‘test’ and ‘train’ subsets.

MNIST illustrates what good data looked like before the rise of the Internet. Government departments had both the rare budget and the access to assemble a dataset from handwritten digits sampled from Census employees and high schoolers. The data was distributed via post, on multiple CD-ROMs. The datasets needed to build models capable of turning machine learning into an industry didn’t exist yet.

The new data set, MNIST (modified NIST), was too large for LeCun current software. So he wrote a new version tailored for the dataset that delivered a groundbreaking error rate of 0.8%, a watershed moment for machine learning in the industry. AT&T used the software to read more than 10% of all checks deposited in the US, at the time.

"Hello MNIST."

Training a network off of MNIST became the "Hello World" of machine and deep learning. The dataset is included in nearly every machine learning framework and is frequently featured as the first project in many textbooks.

ImageNet: More Than a Million Images From and Categorized By Internet Users

14 years later –– smack-dab in the middle of the Web 2.0 era –– Fei-Fei Li, a computer science professor at Princeton, “became obsessed with an estimate by vision scientist Irving Biederman that the average person recognizes roughly 30,000 different kinds of objects.” Li wondered if she could build a comprehensive image dataset for neural network training. She began working on ImageNet, choosing categories from another dataset, WordNet, to build a shopping list for images.

To build ImageNet, Li hired crowdsourced workers on Amazon’s crowdsource platform, Mechanical Turk, to download images from Google Image Search and label them appropriately. Over two years, Li and her lab built and shared a dataset with 1,000 categories and 1.4 million images. To garner attention, Li began hosting a competition to correctly categorize ImageNet images using only software. After a middling couple of years, a 2012 neural network entrant named AlexNet achieved a score of 84.7%, 10.8 percentage points better than that of the runner-up.

AlexNet, written by Alex Krizhevseky with Ilya Sutskever (who would later co-found OpenAI) and Geoffrey Hinton (who just won a Nobel Prize), was a watershed moment that put neural networking firmly on the map. Its arrival seemingly reversed the declining trend of machine learning interest and began it on its upward path. Krizhevseky and co. were able to build AlexNet only because Li built ImageNet with Google and an army of crowdsourced workers.

The Internet allowed for the creation of datasets LeCun could only dream of, enabling broader applications and breakthroughs.

Common Crawl: The Internet Itself, Packaged as a Dataset

In 2007 –– the same year Li began work on ImageNet ––the Common Crawl Foundation was founded by Gil Elbaz. Elbaz’s first venture, Applied Semantics, created AdSense before it was bought by Google. After leaving Google, Elbaz formed Common Crawl out of a, “desire to ensure a truly open web.”

The Common Crawl dataset is a massive open dataset. It contains information from more than 250 billion webpages, collected over 17 years. 3-5 billion pages are added a month. Over a decade after its founding, as people began to realize the benefits of building larger and larger LLMs, Common Crawl became a natural starting point for assembling LLM training datasets.

Like the original NIST handwriting dataset, Common Crawl was unwieldy for model builders. It was designed for researchers studying the web and programmers building new search platforms (remember, machine learning remained a niche field in 2007). So teams began filtering and preparing Common Crawl to easily compare results and save time as they iterated on model training techniques.

Google’s C4 dataset, prepared for the training of their T5 LLMs, is a great example and commonly used Common Crawl variant. To prepare Common Crawl data for model pre-training, the team building C4 filtered out sentence fragments, boilerplate content (cookie alerts and privacy policies, for example), duplicates, source code, and offensive language, yielding a ~750GB subset. The filtered C4 dataset outperformed the unfiltered dataset by every metric.

Common Crawl is a foundational dataset of the LLM age. 60% of GPT-3’s training data is from Common Crawl. It makes up 18% of The Pile, an open dataset used by Micrsoft, Meta, Apple, Yandex, and others to train their models. But despite its giant size and continued growth, Common Crawl has less access to web content today than it did pre-ChatGPT, whose launch spurred media and social platforms to reevaluate their licensing terms. Researchers estimate that 25% of the highest quality data is no longer available to Common Crawl.

So where do we go from here? Projects like C4 and The Pile proved bigger datasets weren’t always better. Smaller models, tuned only on the best subsets, show competitive results at a fraction of the model size.

There isn’t a game-changing dataset out there ready to spur the field like MNIST, ImageNet, or CommonCrawl did. While vertical-specific datasets will emerge (for example, the Overture Maps Foundation datasets) and companies will spend fortunes accruing user feedback, we’ve already used the Internet –– the largest general dataset of them all.

The Graphics Industry Funded the Development of GPUs That Let Us Process Giant Datasets

As datasets for training machine learning models grew, they required faster hardware. After extensive experimentation, GPUs –– which had evolved over decades to perform rapid computations for 2D and 3D graphics –– proved ideal for machine learning.

So ideal, in fact, that the AI gold-rush has benefited Nvidia more than any other company. The GPU maker’s market capitalization has grown by ~260% since ChatGPT’s launch, and currently sits at over $3.6 trillion. Nvidia’s journey to this position only recently had anything to do with machine learning. For the greater part of 3 decades, GPU innovation was paid for by the videogame marketplace.

The Constant Need to Perform Pixel Math

Initially, personal computers didn’t have dedicated graphics processors. The original Macintosh –– the first mass market personal computer with a graphical interface –– rendered its monochrome, 512x342 resolution screen (that’s 1/17th as many pixels as an iPhone 16) entirely by its CPU. It did this while also listening to the keyboard and mouse, managing the disk drive and RAM, and running applications.

If we wanted higher resolution color screens –– and we most certainly did –– our computers would have to work much, much harder to calculate the many possible values for hundreds of thousands of pixels, at least 60 times a second. Rendering a color version of the Macintosh’s screen – in 8-bit color with 256 possible values, the same as the original Nintendo –– required processing 8 times more memory and processing as a monochrome screen.

3 years after the launch of the Macintosh, in 1987, Apple shipped a Mac that could render 8-bit color. The Macintosh II achieved this milestone by including a “graphics card”: a separate device plugged into the motherboard with its own RAM and processor. The graphics card only performed pixel math –– figuring out what value each pixel should have, 60 times a second.

Throughout the late 80s and 90s, graphics hardware continued to specialize. Better cards and chips got you more pixels, more colors, and faster refresh rates. In the 90s gaming began to truly influence the market, spurring GPU card makers to add capabilities for 3D computations – transforming and clipping shapes, dealing with light and shading. These advancements, along with the necessary software adoption, allowed GPUs to render images not simply draw them according to CPU instructions.

The goal was the same – draw better graphics, faster – and the card makers ruthlessly optimized towards that goal. Consider the Nvidia 8800 GTX, which launched in 2006. This card was a monster, the fastest GPU by a wide margin when initially released. It had 128 1.8 GHz processors. Compare this to the Intel Core 2 Extreme, which landed in late 2006. It had only four 2.66 GHz cores. CPUs have a few big, generalized cores. GPUs have tons of small, simple cores.

The optimization of the GPU was chiefly paid for and influenced by the videogame marketplace. Gamers’ insatiable appetite for better graphics created a market for frequently updated cards, in arcade machines, home consoles, and PCs. As 3D gaming began, chips emerged just to handle polygon and lighting math (the Super FX chip that powered Star Fox being a notable example). These functions were eventually merged into the GPU itself.

The first requirements for machine learning arrived in 2001 with the GeForce 3, the first chip capable of programmatic shading. At a high level, this let developers define a tiny short program, which could include assets like images as inputs, which would be run to compute a pixel’s value. Previously, they could only choose from a handful of predefined functions; now the primitives were exposed for writing your own functions. These functions couldn’t be as generic as CPU functions; anything that could be expressed as pixel or 3D math could be run on the GPU’s copious cores.

Preparing Other Problems for Pixel Processors

In 2000, Stanford grad student Ian Buck built an 8K gaming rig using 32 GeForce cards. Buck was working on using distributed graphics systems to render larger displays and had to develop entirely new systems for coordinating this GPU computation. He followed this thread beyond the pursuit of bigger and better displays into more general computing use cases, culminating with his 2004 paper, “Stream Computing on Graphics Hardware.” He spells out the challenge in the abstract:

As the programmability and performance of modern graphics hardware continues to increase, many researchers are looking to graphics hardware to solve computationally intensive problems previously performed on general purpose CPUs. The challenge, however, is how to re-target these processors from game rendering to general computation, such as numerical modeling, scientific computing, or signal processing. Traditional graphics APIs abstract the GPU as a rendering device, involving textures, triangles, and pixels. Mapping an algorithm to use these primitives is not a straightforward operation, even for the most advanced graphics developers. The results were difficult and often unmanageable programming approaches, hindering the overall adoption of GPUs as a mainstream computing device.

The problem is, GPUs have tremendous computing power but only speak in graphics. In the paper, Buck presents his solution: Brook for GPU, a programming system for more easily writing general-purpose computation functions and translating them into GPU code. If your work could cosplay as a pixel problem, it could run really, really fast.

Unsurprisingly, Buck was hired by Nvidia in 2004 (where he remains today). There, he reckoned with Brook’s short-comings and began a project that would fix them: CUDA, which launched in 2007.

Initially, no one was quite sure what to use general-purpose GPU computing for: cryptography, oil and gas exploration, stock market models, biology simulations, and physics simulations were all shotgunned out as potential applications. Nvidia trotted out prototypes and mock-ups, demonstrating physics simulations and biology toys, but they garnered little interest outside a few slices of academia.

Everyone knew CUDA was fast, but no one knew what it was for.

Which brings us back to 2011, to ImageNet and AlexNet, a defining moment in machine learning in more ways than one.

Krizhevseky, Sutskever, and Hinton built AlexNet using CUDA and two Nvidia GTX580s. The GTX580 was a stock consumer gaming card. It cost $500 for the top-tier model with 3 GB of RAM, a notably large size which enabled Kirzhevseky and team to fit their network across two cards. When AlexNet ran away with the ImageNet contest, it not only illustrated the capabilities of neural networks but demonstrated how GPUs were essential tools for the job.

By 2015, everyone knew what CUDA was for and Nvidia focused the project entirely on the neural network use case¹. Their early and constant commitment paid off, as CUDA was ready for its time in the spotlight. Higher-level deep learning frameworks –- like TensorFlow and PyTorch –– made CUDA even easier to use, bringing GPU-acceleration to a much, much bigger pool of programmers.

LLMs Were Built With Three Decades of Internet Content & Graphics Innovations

CUDA² granted incredible, affordable computational performance –– honed over three decades in the graphics industry –– to the machine learning field. CUDA was Prometheus, stealing fire from the gamers and giving it to machine learning nerds. The field accelerated as we figured out how to make models from bigger and bigger datasets. With more minds able to play with machine learning, we discovered new techniques that let us build Large Language Models, or LLMs³, starting in 2018 Google’s BERT in 2018.

But we consumed these gifts –– an Internet’s worth of content and seven gaming console generations’ worth of graphics horsepower –– in only a few years. There isn’t another sector riding in tomorrow with the gifts needed to fuel our next breakthrough. Now we have to take our innovations in real time.

Recent trends demonstrate this pace:

Smaller proprietary training datasets are driving progress, not giant open datasets: Training data from humans using LLMs and giving them feedback –– “Reinforcement Learning from Human Feedback” or RLHF –– is becoming a more significant share of signal. Usage data from when you use chatbots or when contractors code LLM output fine-tune foundation models so they interact more appropriately with users. This data is heavily guarded and hard-fought over –– which is why companies like OpenAI and Anthropic are fighting so hard for users. But this dataset is tiny, at least compared to the Internet. RLHF has its limits as well, due to trainers’ subjectivity. The fact that open models continue to close the gap with the best proprietary models suggests that these datasets aren’t as differentiating as one might hope.
Niche applications, not general intelligence, is making gains: We’ve been waiting for AGI for quite some time, but it’s always around the corner. Meanwhile, vertical-specific applications are making major gains as developers learn to apply LLMs and their techniques to narrow domains. Satellite imagery-specific models are powering incredible applications. And models which create interactive worlds are having a moment.
Big gains are being made in efficiency, not intelligence: We’ve covered this elsewere, “Amongst the leading model builders, while quality has plateaued, efficiency has skyrocketed.” And the prices keep falling: Claude 3.5 Haiku dropped by 20% yesterday and Amazon’s Nova LLMs are pushing the floor even lower.
The chatbot conversation has fully embraced agents: Since OpenAI’s o1, everyone’s talking about agents. Among the McKinsey set, “GenAI” is so tired. All the cool kids are suddenly all-in on “AgenticAI.” My inner-skeptic says this is a key indicator general LLM progress truly has stalled; chatbots can’t solve the problems we give them in one shot, so let’s give them several. Perhaps agentic models will yield better results –– by many metrics o1 already has –– but they’ll be even slower and pricier.
Machine learning workloads are being spread around the server: Especially when it comes to fine-tuning or inference, model workloads are being spread to lower-tier Nvidia silicon and non-Nvidia chips as new software libraries emerge. These software libraries will grow and mature (NPU libraries have nowhere to go but up!) and potentially evolve into backends that higher-level libraries can use.

These (and more) are irons in the first that could yield big gains. But I’d wager we’ll instead see consistent, incremental results in LLM capability.

I don’t think this is a bad thing! (Unless you’re an company whose valuation hinges on delivering AGI.) I continue to believe we have incredible untapped potential in the current models that we’re only now learning how to apply. We could probably freeze LLM development and build valuable apps off the current state for years. As designers, developers, and other builders learn how to apply AI cogs selectively –– delivering ‘quiet’ AI features –– we’ll get improved existing tools and wholly new ones.

LLMs have hit a wall. Now begins the slow climb upward.

So much ink is spilled on Nvidia and its incredible market cap, but I continue to think we underappreciate the lessons from CUDA. Nvidia’s leadership kept investing in an R&D project in the hope that a new use case would emerge for their hardware, despite it not paying off for nearly a decade. And they didn’t try to force a use case – the net was cast wide until it was absolutely clear that ML training had immense potential. Finally, they didn’t fall for crypto. In earnings call after earnings call they downplayed revenue from cryptofarms hoarding GPUs –– they even saw it as a problem as it was incredibly frustrating for their gamer customers. CUDA is perhaps the primary reason Nvidia, and not AMD or Intel, is worth trillions of dollars. ↩
The multiplatform OpenCL framework why it didn’t find traction like CUDA is beyond the scope of this post. ↩
If you want to learn more about how exactly LLMs work – how they turn language into pixel math in order to build the models behind your favorite chatbots –– I highly recommend 3Blue1Brown’s recent short primer on the topic. ↩