We Need AI Safewords

Standard commands confirming an AI’s presence is a common sense starting point for AI controls

The effectiveness of Large Language Models has spurred an AI renaissance. We’re awash in models capable of generating images from mere phrases and chatbots capable of holding confident, emotional conversations. The noise of crypto subsided just as this generation of deep learning models refigured technology conversations seemingly overnight, posing BIG questions that seemed years away only months ago.

People who hadn’t been following AI were whiplashed by the performance of ChatGPT, Stable Diffusion, and others. They asked questions about copyright law, the exploitation of training data, and wondered about labor regulations and worker rights if AI renders entire jobs obsolete. Others, who’ve been swimming in AI for years, set their sights far beyond these concerns and began to prepare for true artificial intelligence, or AGI.

The wide range of applications these models seem to address has boggled our minds and diluted our conversations. Only now do we seem to be stepping back and gathering our senses. We’re seeing these models have limits and flaws. We’re starting to see that they’re not gods or monsters. Just organized bundles of statistics beyond the scale of human comprehension.

In this moment I’d like to ignore the big picture questions and argue for something relatively mundane: safewords.

I believe fraudulent misuse of models poses an immediate risk. The ability of interactive models to generate voices, videos, and text capable of impersonating humans will become a powerful tool for social engineering at scale. The most common scams already focus their attention on those unfamiliar with technology (especially seniors) to manipulate access to bank and e-commerce accounts. These tactics are so effective, large market ecosystems have emerged to execute them at scale, powered by human trafficking. This slavery is both what enables these scams and limits them, as the scam itself is dependent on canny humans manipulating marks one by one. New AI threatens to remove this limitation, scaling the scams beyond what even slavery allows and potentially leading to a diverse array of smaller-money tactics which previously hadn’t been worth the time.

The threat of AGI is a far-off dream. LLM-powered fraud is kicking off now.

A relatively quick regulatory action we can take is requiring government-approved safewords to be built into AI models. These safewords, when input into an AI-powered interface, would confirm to users they are interacting with an AI and provide basic metadata detailing the given model.

I haven’t thought through all the details, but it’s worth sketching this out in hopes of kickstarting the conversation.

Adding safewords to base models – the largest, foundational models which are tuned into custom applications (GPT is one such base model) – is an effective mechanism for adding regulation without hindering the ecosystem at large. Large base models require significant resources to build, limiting the number of parties able to develop them. Safewords added as custom rules to these models will not hinder their effectiveness for all non-fraudulent use cases.

There is history and comparables here: AI safewords are akin to WHOIS and other ICANN mechanisms for website accountability and transparency. Optional standards, like robots.txt, also define ways forward and models for implementation.

Requiring AI-powered interfaces and models to respond appropriately to approved safewords will not eliminate fraudulent behaviors by bad actors. But it lays the groundwork for enforcement mechanisms that governments can use to police bad actors and reduce the ease with which actors can leverage the largest, most effective models for illicit actions. Further, such requirements and enforcement should not hinder AI innovation. So long as a model complies with safeword requirements, leeway is granted.

Finally, AI safewords provide a tangible escape hatch for users to utilize when AI models start to cross emotional lines. We humans are hardwired to engage in anthropomorphism, seeing the mark of intelligence where only hyper-scale cut-and-paste exists. By building in standard mechanisms for confirming the a model’s inhumanity, we allow users to ground themselves on occasion.