---
title: "Extreme Compression with AI: Fitting a 45 Minute Podcast into 40kbs"
date: 2023-11-07
author: Drew Breunig
description: "Can deep learning change our approach to data transmission in extreme bandwidth-constrained contexts? Let's build a proof-of-concent to reduce a podcast to less than one-tenth a percent of its original size, then reconstitute the audio with text-to-speech."
tags: ["ai", "poc"]
url: https://www.dbreunig.com/2023/11/07/extreme-compression-with-ai.html
---

![Generated with DallE](/img/inside_the_stereo.jpg)

Way back in 2018, my friend [Pete Warden](https://petewarden.com/) wondered if [data compression would be machine learning's killer app](https://petewarden.com/2018/10/16/will-compression-be-machine-learnings-killer-app/). Looking back, in the age of ChatGPT, this thought seems positively quaint. Artificial intelligence boosters are talking about how AI will generate new medicines, new materials, and new content. But the idea of using AI to compress data is still a powerful one.

I touched briefly on this notion [last April](https://www.dbreunig.com/2023/04/10/the-privacy-question-and-open-ai.html):

>The only thing that isn’t big about LLMs is the filesize of the model they output. For example, one of Meta’s LLaMA models was trained on one trillion tokens and produced a final model whose size is only 3.5GB! In a sense, LLMs are a form of file compression. Importantly, this file compression is lossy. Information is lost as we move from training datasets to models. We cannot look at a parameter in a model and understand why it has the value it does because the informing data is not present.

>(Sidenote: this file compression aspect of AI is a generally under-appreciated feature! Distilling giant datasets down to tiny files (a 3.5GB LLaMA model fits easily on your smartphone!) allows you to bring capabilities previously tied to big, extensive, remote servers to your local device! This will be game-changing.)

I found myself thinking about compression yesterday while reading about [OpenAI's DevDay Announcements](https://openai.com/blog/new-models-and-developer-products-announced-at-devday). Way down at the bottom (below all the exciting stuff) is a new text-to-speech (TTS) API and an updated version of Whisper (OpenAI's speech recognition model). Neither application is particularlly novel (you can [take your pick of TTS models over at Hugging Face](https://huggingface.co/tasks/text-to-speech)), but the two announcements taken together suggests an comically extreme audio compression pipeline. 

With an API account and less than a dollar in credits we can transcribe an audio file into a text file then generate speech from said text file. If we only transmit the text file over the network and run our TTS model at the edge, our bandwidth savings are *monumental*.

I ran the pipeline on an episode of my favorite podcast, [In Our Time](https://www.bbc.co.uk/programmes/b0bk1c4d), and obtained the following file sizes for each step:

<iframe title="&quot;In Our Time: Automata&quot; File Size in Bytes" aria-label="Bar chart" id="datawrapper-chart-24fG6" src="https://datawrapper.dwcdn.net/24fG6/1/" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important; border: none;" height="160" data-external="1"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(a){if(void 0!==a.data["datawrapper-height"]){var e=document.querySelectorAll("iframe");for(var t in a.data["datawrapper-height"])for(var r=0;r<e.length;r++)if(e[r].contentWindow===a.source){var i=a.data["datawrapper-height"][t]+"px";e[r].style.height=i}}}))}();
</script>

The transcription text file is just *0.08% the size* of the original audio file.

And the output doesn't sound terrible! 

Here's Melvyn Bragg in the original:

<audio controls>
  <source src="https://pub-931759a51c654585bb3041ecb61ef9ad.r2.dev/iot_automata_intro_mono.mp3" type="audio/mpeg">
Your browser does not support the audio element.
</audio>

And OpenAI's "alloy" voice in the TTS output:

<audio controls>
  <source src="https://pub-931759a51c654585bb3041ecb61ef9ad.r2.dev/tts_automata_intro_44m.mp3" type="audio/mpeg">
Your browser does not support the audio element.
</audio>

Look, I'm going to take Melvyn's cadence and voice any day of the week, but the TTS output is very, very listenable. And in extreme situations where bandwidth is severely limited and only available for short times (think: the International Space Station, Antartic labs, cruise ships, battlefield frontlines, etc.) this kind of compression could be a game-changer.

And there's nothing to stop us from taking this further. We could deploy customized voice models at the edge, trained on specific speakers. We could add diarization and use different voices for different speakers. Rehydrating text into audio could take different approaches for different sources (like speeding up a slow speaker or emoting a flat one). 

There's a multitude of possibilities in just this one specific niche. How might we expand it to images or video? What other use cases are ripe for applying this type of extreme compression?

--------
