Manipulating AI: Prompt Injection, Trap Tokens, & Post-Training Suggestion

A sweaty Frank Sinatra uses the Queen of Diamonds exploit.

The near-infinite attack vectors of AI

Yesterday we talked about the potential of AI Bombing, manipulating LLMs through coordinated, malicious usage:

Organically or for hire, people will coordinate their usage of an LLM, prompting the model with specific feedback. They might provide false information — associating a context with an untrue fact — or true, private information — seeding everything in a model needed to dox an individual when asked.

But this tactic is not constrained to the user interfaces of these models. Manipulative instructions can be (and likely are) planted in external data sources, which are ingested during training data compilation or referenced by plug-in style connectors.

Here’s a fascinating proof-of-concept by wunderwuzzi where they speak prompt injection commands in a YouTube video and then demonstrate their ability to hijack a ChatGPT prompt via a YouTube transcription plug-in. It’s bonkers.

A screenshot of wunderwuzzi's spoken prompt injection YouTube video

What’s scary about this demonstration is that the attack vector is so damn large. If LLM-powered applications reference the live internet, manipulative instructions can be planted anywhere. They can be written, spoken, or sung. Similar to how you should never click a hyperlink in an email, perhaps you should never prompt an AI to reference a bit of content you don’t understand or control.

So how might this be used?

I wouldn’t be surprised if we see prompt injection in song lyrics. The battleground over training data rights is only beginning, and it looks like the music industry will drive the most significant early pushback. But in the interim, could two bars in a verse poison any AI that attempts to parse a hit song?

Beyond prompt injection, will we see intentionally unique phrases in lyrics so owners can prove their work was used as training data? Such a tactic has precedence in Trap Streets, intentionally false details inserted into street maps to catch plagiarists. “Trap Tokens” in content could be designed: unique phrases that have a low likelihood of occurring elsewhere in training data and are sufficiently novel to not occur as hallucinations. Being able to elicit Trap Tokens could establish proof a model references your data.

If we combine the ideas of Trap Tokens and prompt injection we arrive at a manipulation technique akin to post-hypnotic suggestion: intentionally planting input in training datasets for purposes of later, post-training manipulation. We could plant unique phrases or intentional contextual associations in freely available datasets. Or we could craft honeypot websites with carefully designed content, waiting to be scraped. Let’s call this “Post-Training Suggestion”.

Such a tactic would be inconsistent, imprecise, and delayed. You wouldn’t be able to verify success until new models ship and are made available. However, Post-Training Suggestion is a much more useful tool for those involved in training the models. For example, an individual worker could plant a backdoor via a Post-Training Suggestion. Given the scale of training datasets, finding these latent threats is nearly impossible.

Prompt injection, trap tokens, and post-training suggestions are curiosities today, given how and where LLMs are deployed. But as such apps proliferate – either tied into central services like OpenAI or sharing common foundations like LLaMA – it’s easy to come up with potential issues.

I’m reminded of the scene in The Manchurian Candidate where Frank Sinatra breaks Laurence Harvey’s hypnotic trance by hijacking his post-hypnotic suggestion trigger, the Queen of Diamonds. Plant the right cues in a training dataset today and you might unlock tomorrow’s AI-powered cameras with a Queen of Diamonds.