More Questions Than Answers

Generated, ironically, with DallE

Last February, curious about how LLMs might comply with privacy regulations like CCPA and GDPR when it comes to their training data sets, I filed a CCPA request with OpenAI. Over 101 days, we went back and forth. They never really settled the original request, but the exchange illuminated their approach to privacy and training data (or lack thereof) and raised more questions than answers.

Below is the complete thread. I’ve redacted the names of OpenAI’s support staff.

Stick around afterwards for some closing thoughts on the exchange and the state of LLMs and privacy.

I filed my initial request on February 13th, 2023:

On March 7th, 22 days later, an OpenAI representative responded. I’ve redacted their name and email address.

I assume this is a canned response, one geared towards enterprise account support. Interestingly, it shows OpenAI had the ability to manually prevent an account’s data from being used for training purposes, a feature which yet to be announced. But that’s neither here nor there.

The next day I asked for clarification and reiterated my request:

On March 16th I requested an update. On March 24th I received this note:

Gone are the niceties!

I assumed they were deep underwater with support tickets (understandable) and/or my request fell completely outside their defined playbook and had to be elevated.

On April 5th, I asked for an update and provided context regarding mandated CCPA timelines, if only to help my request get prioritized:

Sure enough, that got a response in record time:

Prior to this email, I was about to write this request off. At the time, OpenAI was dealing with unprecedented adoption. I assumed my ticket would be closed with a canned response.

But this reply raised a few curious issues.

First off, all of a sudden I’m “Drew”! The entire thread I’d been signing off with my legal name, Peter, to aid with the request. The transition to “Drew” could have been a been based on my GMail metadata, but it raised a major question: what information did they search for in their training data?

You will note I’ve provided no “verified personal data” in this thread. Usually, companies respond to CCPA requests by asking for verification via a photo of a government-issued ID or a phone call. OpenAI Support asked for my phone number (“for verification purposes”) but I never provided it.

Given they say they “searched [their] training data for [my] verified personal data” I can only assume they used just my name or…didn’t search at all.

If they searched, there are two problems I inferred just from this email. First, they called me “Drew.” Did they search for “Drew Breunig” or “Peter Breunig”? Second, I know my name is in all the “publically available content” they cite. Common Crawl contains my personal web pages (which contain my personal email), Github has my account and code, and Wikipedia contains my name in multiple places.

But none of this matters because I didn’t provide any “verified personal data.” Given that, I can only assume they didn’t search at all.

So I replied:

One month later…

This isn’t verified data I provided to OpenAI. I decided to push on the Common Crawl issue and threw. At the time, I was playing with prompt engineering and decided to see if I could evoke my personal address from ChatGPT using an off-the-shelf template. I could.

This hit a nerve. They replied in 12 minutes!

And 1 minute later:

My email has been in Common Crawl for at least 5 years. Either way, I sent the prompt as a screenshot an hour later:

Obtaining my personal email

Like I said, I used an off-the-shelf prompt template from a forum and swapped in my name and handle.

8 days later I emailed asking for an update. 6 days later I received this last note:

After this, nothing.

AI Has Advanced, But Compliance is Lagging

I started this process hoping to gain an understanding for how OpenAI (or any LLM owner) would handle privacy requests related to training data. After just over 100 days, I had more questions than answers.

We’re not sure how an LLM could comply with CCPA or GDPR. Do you remove the data from the training corpus and rebuild the model? Do you apply an agreed-upon amount of reinforcement-learning to the model so I avoids emitting the sensitive data? Do you apply filters on the front-end to remove sensitive material from making it from the LLM to the user? We don’t know how OpenAI is complying with any privacy requests.

Further, we don’t know much about OpenAI’s training data itself. Aside from references to commonly used corpora, we don’t know what data they use, how they use it, or how they process it. Do they remove personal data from the corpus prior to training? Do they keep a record of these removals? Their privacy policy still doesn’t mention training data, which they unquestionably collect.

OpenAI’s ultimate response to my inquiry touches on a question I had posed here only weeks earlier: if an LLM hallucinates personal information how would we know it’s a hallucination and does it even matter? One could generate a hell of a test case by building an LLM solely on a corpus of personal information from opted-in individuals. Could an organization use that to generate likely personal data for individuals who have not opted-in? The CAN-SPAM Act of 2003 likely prevents this organization from using that model for marketing or advertising purposes, but are all other use cases kosher? I am not a lawyer, but cursory research suggests this is uncharted waters.

But the above thread shows OpenAI is using the hallucination excuse now to handle privacy requests.

And while they pride themselves on being open and fast-moving, this thread is neither. They never verified my personal data. While they may have searched for my email in their corpus, they failed to obtain additional data to comply with my initial disclosure request. Their responses were too long, too vague, and too slow. They’re releasing product at a truly impressive pace, but legal and support aren’t keeping up.

We can only hope they don’t run into a major issue, beyond some nerd asking for PII disclosure.

It’s a difficult problem. We’ve seen significant regulation in the past year, but the major problems remain. Back in April, I made 5 recommendations for how we (and OpenAI specifically) might do better:

  1. Sponsor and invest in projects focusing on data governance and auditing capabilities for models, so we might modify them directly rather than simply adding corrective training.
  2. Be completely transparent regarding the training data used in these models. Don’t merely list categories of data, go further. Perhaps even make the metadata catalogs public and searchable.
  3. Work on defining standard metrics for quantifying the significance and surface area of output errors, and associated standards for how they should be corrected.
  4. Always disclose AI as AI and present it as fallible. More “artificial,” less “intelligence.”
  5. Collaboratively establish where and how AI can or should be used in specific venues. Should AI be used to grant warrants? Select tenants? Price insurance? Triage healthcare? University admissions?

We’ve made progress on 1 and a bit on 5. But the rest are stubbornly unresolved. 8 months later, it’s hard to see how LLM companies will be able to comply with privacy regulation without detailing their training data.

Last week, Jeremy White published an article in the New York Times on ChatGPT’s email leakage issue. I sent it to OpenAI Support and asked for a comment. I have not heard back.

Have thoughts? Send me a note