If you already have an app with users, how should you use LLMs?

Seasoning a dish with words.

AI hype is crazy right now.

People claiming large language models (LLMs) will replace millions of jobs. Others claim AI can predict tomorrow’s stock market from today’s headlines. Tools like AutoGPT have stirred the froth further, with “thought leaders” (who used to only talk about crypto but are suddenly positioning themselves as AI experts, for some reason) claiming these scripting-but-for-AI implementations are showing signs of true AGI (artificial general intelligence).

Frankly…it’s exhausting.

But the volumes and claims are hard to ignore.

Those of us with existing projects, products, or companies are starting to feel that dreaded twinge of FOMO. Should we scrap our roadmaps and bolt on semi-sentient customer support avatars? Should we scrap our prototypes and pivot to generative, on-demand comic books? Or should we merely play around with Stable Diffusion and ChatGPT all day when we should be working?

Maybe we should just tune it all out.

I’m going to suggest we do none of those things. Instead, app builders should dramatically limit the scope of how we envision using LLMs. We should not see AIs as nearly-human, magical APIs which can fulfill our craziest hopes and dreams. We should keep them on a short leash, rather than give them enough rope to hang ourselves.

Why Keep AIs On a Short Leash

AIs produce inconsistent output for an unpredictable price while adding unknown risks.

Let’s break down each of these in order:

Inconsistent Output

  • AI’s aren’t actually smart. They’re sophisticated guessing machines powered by a comical amount of training data. They don’t know or understand anything. They take an input, compare it to previous contexts they’ve seen, and take their best guess at a response. This can result in inconsistent and erroneous responses.
  • LLMs can only be ‘programmed’ with imprecise natural language. The dirty secret of AI is we don’t really know what’s going on in the parameters of the model itself. We can’t reach in and discretely program them to respond in specific ways. Our only recourse is to add more training data with natural language and feedback. Any linguist will tell you, language is not precise. It’s neat that you interact with OpenAI’s APIs with English sentences. But in many ways, this is a bug, not a feature.
  • As a result, there’s no guarantee your API calls to OpenAI will be correct or yield reproducible results. For example, while testing some of the use cases in this article I’d ask GPT to respond with a JSON array. For a few prompts it would wrap the JSON array in Markdown code block formatting and add comments, seemingly at random. Making the prompt more precise or rewriting the prompt using a different approach didn’t help. I ended up having to use regular expressions to extract the JSON reliably. Who knows if that will continue to work in the future!

Unpredictable Costs

  • LLMs can get very expensive at scale. This excellent guide by Chip Huyen details how these calls can stack up. A reasonable GPT-4 query might cost ~$0.60(!). A more reasonable GPT-3.5-turbo call only costs ~$0.04. That’s tiny for one-off or personal usage, but tie that to a feature that fires up for most page requests and that cost adds up fast.
  • LLMs can be frustratingly slow. You’ve likely experienced this if you’ve typed a prompt into ChatGPT. Responses aren’t instant and ChatGPT hides the total time by scrolling out each work as it lands, rather than waiting for completion. Any features you build into your app will increase latency. And it will be inconsistent, to boot!

Unknown Risks

If you’re a small shop or team with a successful app, you’re likeley breaking out in cold sweat right about now just thinking about dropping AI into your stack. LLMs can’t be tested, could balloon your costs, and expose you to unknown and exotic security and liability edge cases.

But despite everything above…I still think you should explore AI. We’re going to show off some use cases below that are powerful, fascinating, and valuable. Machine learning’s ability to make more things programmatic, the messy things which humans create addressible by computers and code, has amazing potential.

If we can mitigate the risks above – by keeping AI on a short leash – we can expand the productivity and capability of our teams and build better interfaces.

Our Short Leash

Given the above, I’ve arrived at 4 rules app developers should use when considering AI, specifically LLMs.

  1. Treat AI work like you would expensive database jobs. Calls to AIs take too long and can be expensive. Use them sparingly and only in asynchronous jobs.
  2. When possible, keep a human in the loop. AI output should be approved for broader use by a select user or even the site admin. If a task has sufficient guardrails (both in the prompt, handling of the response, and app UX), at least allow users to edit or flag bad work.
  3. Never give a user direct access to an open text box connected to an AI. Prompt injection is a bear of a problem your team won’t solve. You protect against SQL injection; protect against prompt injection. The tech just isn’t ready yet.
  4. Don’t count on any particular LLM. Don’t count on OpenAI. Don’t count on Bard. Regulatory questions are going to be hashed out over years and during that time your region might ban or limit the tool you use. Also, these things are expensive and platforms might raise their prices to an amount you can’t swallow. Expect change. Don’t build mission-critical features on an LLM, especially one you don’t control. Further, save your prompts and interactions. Your interactions (approvals, flags, etc) are valuable training datasets relevant to your domain. Bring them with you if you need to shift shops.

There are probably more than 4 rules for keeping AI on a short leash. Please give me a shout if you have any candidates!

But for now, these are our guardrails. Let’s take a look at a few valuable use cases for app makers that stay within this safe zone.

How To Use AI In Your App

For our discussion, let’s assume we have a Rails app. This discussion holds for Django, Express, React, Vue, Laraval, or whatever. I’m using Rails as an example because I do have a Rails app, which is generic enough for our purposes.

My Rails app is a hobby app called StepList I and a few others use to create, manage, and perform checklist routines. In StepList you might create a checklist (“Weekly Home Cleaning”), schedule that list (remind me to perform it every Sunday at 1 pm), perform the list, and review your past performance.

It’s a simple app, nothing fancy. Similar to many to-do list tutorials. (If you’re curious, I’m happy to keep you posted.)

The home screen and performing a routine in StepList.

The home screen and performing a routine in StepList.

So, following our Short Leash Guidelines above, how might we use LLMs, OpenAI’s APIs specifically, to ship a better app?

In my eyes, app developers should use LLMs to expand our programmatic domain, beyond the design and function of the app, into our app’s content.

To use our Rails example, we add a new domain and language into our kit:

  • Postgres for our database
  • Ruby for our server-side logic
  • ERB for templating our HTML
  • CSS for styling
  • Javascript for client-side interactivity
  • Prompts for content manipulation

Given this framework and our Short Leash Guidelines, below are 4 example AI use cases (using my StepList app as an example) divided into Tooling and Content Management categories.

(All of the examples below are available as functioning proofs-of-concept on Github.)

Practical AI Use Cases For Apps: Tooling

Database Seeding with AI Faker

When building your app it’s nice to have fake, but realistic, data populating your development database. This helps ensure your app’s design and function accomodate a wide variety of content and catches bugs and edge cases before real users arrive. A popular tool for generating fake data is Faker. With Faker you can generate piles of names, email addresses, websites, text, and so much more. We use Faker to develop StepList. It creates plenty of user profiles, team names, and (of course) boilerplate text for our checklists.

But there’s a problem: Faker doesn’t specifically generate checklists or even titles for posts. We’re left having to either write our own lists by hand for loading or generate nonsensical Lorem ipsum strings. Neither is a great option. Our own lists will take up our own time and be biased toward how we imagine our app will be used. While nonsense strings won’t accurately represent list content at all.

For fake content needs, GPT-3 works wonderfully. In the following script we:

  1. Provide example titles of checklists to give the LLM an example of good output.
  2. Call GPT-3.5 with the prompt, “Write 50 more strings like the following strings, formatted in a JSON array with a separate element for each new string,” followed by our list of examples.
  3. Parse the JSON output into an array of strings.

For this example, I’ll show you the code. (The rest can be found on Github)

Here’re the first 5 responses I received just now:

  • How to make a perfect omelette
  • Planning a road trip across the country
  • A daily meditation practice
  • Steps for organizing your closet
  • Practicing your public speaking skills

Now we can populate our app with as much fake content as we want, in a format that is less biased by our own expectations with nearly no effort. We can reduce this bias further, as we iterate, by either dialing up the ‘temperature’ parameter in our GPT call or by feeding in real titles provided by our initial users.

Generating fake data with LLMs is a perfect short-leash use case. We can run it only occasionally, caching a large library of generated data as we go. It has zero exposure to our end users and can be generated with any LLM we have access to, since quality isn’t terribly important here.

Auto Localization

If your app finds traction, you’ll start thinking about localization (translating your app for users in other markets) at some point. According to the Rails docs, “The process of ‘internationalization’ usually means to abstract all strings and other locale specific bits (such as date or currency formats) out of your application. The process of ‘localization’ means to provide translations and localized formats for these bits.”

This means all the words and text in your app’s interface (the page titles, navigation elements, button text, etc.) get extracted into a single file for each language. Each language file is organized with the same keys (like, “welcome_back_message”) which map to user-facing strings (like “Welcome back!” or “Bienvenido de nuevo!”).

Again, this is a perfect fit for an LLM, which excel at translation. Here’s a script which:

  • Reads in an ‘en.yml’ file containing all our English strings and their mappings.
  • Accepts an array of languages we want to translate our file in this instance we provide: [“es”, “fr”, “de”, “it”, “ru”, “ja”]
  • We then prompt gpt-3.5-turbo with, “Translate the following string into the following languages: #{desired_languages}. Return only a JSON dictionary with each respective language abbreviation as a key and the corresponding translation of the following text as the value. Here is the text to translate into each language:\n#{text}”.
  • We run this prompt for every one of our English strings, building up a dictionary of their respective translations.
  • Finally, we walk through the original ‘en.yml’ file mappings and output each language’s equivalent file.

Is it perfect? Almost certainly not! Can you provide it with dozens of languages at once and receive dozens of localized files in under a minute? Hell. Yes.

This perfectly fits our short leash guide. It’s a job that is run rarely, never directly exposed to our users, and can be reviewed by us or contracted experts for correctness later. This is a powerful, low-risk example of how we can use LLMs to dramatically expand the usability of our app with the tiniest effort.

Practical AI Use Cases For Apps: Content Manipulation

Content Categorization

In StepList, we prompt users to tag their lists with categories to help others discover lists. Here an author has helpfully tagged this list with “Cycling” and “Exercise”:

A list which has been tagged with 'Cycling' and 'Exercise'.

Most authors don’t tag their lists and almost all authors don’t comprehensively tag their lists. And we shouldn’t expect them to! It’s not their job to add “biking”, “bikes”, “century”, “training”, and more to a list like this one.

But it can be OpenAI’s job!

In this script we feed gpt-3.5-turbo text we extract from a URL along with the prompt, “Analyze the following article text and provide #{number_of_desired_categories} 1 to 3 word topics to help a user understand the subject being discussed. Return these categories as individual strings contained within a JSON array.” In this script, we ask for 5 categories.

In our app (or your app), we can write this task as a job that is created when a list is published. We don’t have to parse a URL, we can simply send the list’s title and steps straight from our database. Finally, we can save the returned tags as ‘recommended_tags’ are treated differently than user-entered tags in our app. For example, they could be presented to the list’s author or our site admin for approval. Or, prior to approval or rejection, they could be treated with a different color while still aiding discovery.

In our first example that exposes AI output to users, we’re still keeping the LLM on a short leash. Users can’t directly send open-ended text to OpenAI and we can put constraints on how we parse the tags returned by OpenAI (for example, rejecting any category longer than 3 words). At worst, users are presented with bad tags. But we keep humans in the loop and allow them to accept or reject bad advice. Finally, we’re constraining our prompt to a job that fires only with list publishing, preventing usage from running amok or dragging down our interface. With little risk we can greatly improve content discovery with low effort from our users or our team.

Content Summaries

In app frameworks, we frequently use view components: encapsulated building blocks for assembling our UI independent of our content. Often, these components are responsive, meaning they adjust themselves appropriately for smaller and larger screens.

If we have a list of StepLists for a user to browse, we want to be selective about how much detail we show given the context. In a small context – a cellphone, watch, or even a dropdown menu – we should show just the title of a list. Maybe a tag or two. But on a desktop browser, we can show the title and a brief description of the list.

A user could write this description. But what if they don’t? Or, what if we want several descriptions of different lengths, each summarizing the list using a different amount of words for different view contexts? That way, if a user is browsing a long list on a desktop they might see a one-sentence description under each title and if they click onto a list page they are presented with a more detailed paragraph.

Asking our users to write a single description is one thing. Asking them to write several for different view contexts is ridiculous. But it’s reasonable to ask an LLM!

Once again, we’ll provide gpt-3.5-turbo the title and contents of the list. Initial requests to summarize our list yielded pretty bland results that reiterated steps compactly. After a few revisions to our prompt, I arrived at the following: “Write a description for the following list both as a brief sentence and a paragraph with 2-4 sentences. Don’t just summarize the steps, also express why following this list might be important. Return the brief sentence and paragraph as two strings in a JSON array. Here is the list…” Using this prompt with our, “Get Ready For a Long Bike Ride” list yielded the following:

  1. “Preparing adequately before a long bike ride can make the difference between a fun and safe ride and one that is frustrating or even dangerous.”
  2. “This routine provides a comprehensive checklist to ensure that you have all the necessary equipment, tools, and supplies to handle any issues that may arise on the ride, as well as properly fueling and hydrating yourself. Taking these steps can increase your enjoyment of the ride and help you stay safe and prepared in case of unexpected situations.”

Like our previous use case, we’ll only run this task as a background job when someone publishes a list. We can save the parsed response as ‘short_description’ and ‘long_description’ in a ‘ListDescription’ object. And, to ensure we’re capturing our interactions with the AI for future training or migration, we’ll note when we generated the descriptions and have fields for author approval and flagging.

This use case is perhaps the best example of expanding our programmatic domain, beyond the design and function of the app, into our app’s content. Without an LLM, responsively adjusting our content might involve us truncating our labels (just chopping them off after a fixed number of characters) or not showing them at all. With LLMs we can selectively adjust them while preserving contextual intent, rewriting content to fit different view contexts.

And all while keeping the AI on a short leash.

To Recap

AIs produce inconsistent output for an unpredictable price while adding unknown risks.

But they allow app makers to expand our programmatic domain, beyond the design and function of an app, into an app’s content.

LLMs can dramatically improve our productivity and the usability of our apps. But we need to manage the risks by establishing guardrails.

Keep AI on a short leash. Don’t give them enough rope to hang ourselves.

Keep AI On a Short Leash

  • Treat AI work like you would expensive database jobs: Use them sparingly and in the background.
  • When possible, keep a human in the loop: Allow users or admin to accept or reject generative content.
  • Never give a user access to an open text box connected to an AI: Prompt injection is an unavoidable threat.
  • Don't count on any particular LLM: Don't give them mission-critical work, save your interactions, and be ready to migrate.

Have thoughts? Send me a note