My CCPA Dialog With OpenAI

More Questions Than Answers

Generated, ironically, with DallE

Last February, curious about how LLMs might comply with privacy regulations like CCPA and GDPR when it comes to their training data sets, I filed a CCPA request with OpenAI. Over 101 days, we went back and forth. They never really settled the original request, but the exchange illuminated their approach to privacy and training data (or lack thereof) and raised more questions than answers.

Below is the complete thread. I’ve redacted the names of OpenAI’s support staff.

Stick around afterwards for some closing thoughts on the exchange and the state of LLMs and privacy.

I filed my initial request on February 13th, 2023:

February 13th, 2023

From: Peter Breunig

To: OpenAI Support

Hi there,

In your privacy policy you list data you collect. None of the data elements listed specify or relate to data you use for training your models. It is reasonable to assume data for model training has been collected by you directly or via a third party.

I am writing to request what personal data of mine you have in your data which you use for training, and how you have used said data. Further, I am requesting that you provide more information regarding data you collect for your machine learning models -- similar to the data types you list in section 1 of your privacy policy. I also recommend these details be included in your privacy policy, as required.

Peter Breunig

On March 7th, 22 days later, an OpenAI representative responded. I’ve redacted their name and email address.

March 7th, 2023

From: OpenAI Support

To: Peter Breunig

Hi there,

Thank you for your request. For verification purposes, please provide us with the phone number that you used during the account creation process. Once we have verified your ownership of the account, we can start gathering this information for you.

Also, we have added an exclusion for your organization's data, so that it will not be included in model training.

Best,

OpenAI Support

I assume this is a canned response, one geared towards enterprise account support. Interestingly, it shows OpenAI had the ability to manually prevent an account’s data from being used for training purposes, a feature which yet to be announced. But that’s neither here nor there.

The next day I asked for clarification and reiterated my request:

March 8th, 2023

From: Peter Breunig

To: OpenAI Support

Hi [OpenAI Support],

First off, I sent this note in a personal capacity. For what "organization" did you add an exclusion to prohibit models being trained from its data? Second, my request was for disclosure, not deletion. And even if it were for deletion, adding an exclusion is not deletion.

Finally, my request does not apply to OpenAI account registration data. I specifically asked for you to disclose any personal data you have collected or controlled, pertaining to me, within any datasets you use for training your models.

Please address my initial request and answer the question regarding "organization" above.

Thanks

Peter Breunig

On March 16th I requested an update. On March 24th I received this note:

March 24th, 2023

From: OpenAI Support

To: Peter Breunig

Thank you for your request. We are in the process of gathering this information for you and will reach out with an update soon. Please let us know if you have any questions in the meantime.

OpenAI Support

Gone are the niceties!

I assumed they were deep underwater with support tickets (understandable) and/or my request fell completely outside their defined playbook and had to be elevated.

On April 5th, I asked for an update and provided context regarding mandated CCPA timelines, if only to help my request get prioritized:

April 5th, 2023

From: Peter Breunig

To: OpenAI Support

Hi [OpenAI Support],

Checking in on my request. According to CCPA, you all had 45 days to respond to a request to know or delete. My request was sent on February 13th and the 45 day period ended March 30th. You may have an additional 45 days to respond (for a total of 90), provided you provide me with, "with notice and an explanation of the reason that the business will take more than 45 days to respond to the request."

Please update me on the status of this request, if you need an additional 45 days (ending May 14th), and the reason why you all will need that time.

Thanks

Peter Breunig

Sure enough, that got a response in record time:

April 7th, 2023

From: OpenAI Support

To: Peter Breunig

Hi Drew,

We searched our training data for your verified personal data, such as email address, and we did not find any hits. Our models are trained on publicly available content from the Internet (such as Common Crawl, Wikipedia, and Github), licensed content, and content generated by human reviewers. Please let us know if you have any other questions or concerns.

Have a wonderful weekend!

Best,

OpenAI Support

Prior to this email, I was about to write this request off. At the time, OpenAI was dealing with unprecedented adoption. I assumed my ticket would be closed with a canned response.

But this reply raised a few curious issues.

First off, all of a sudden I’m “Drew”! The entire thread I’d been signing off with my legal name, Peter, to aid with the request. The transition to “Drew” could have been a been based on my GMail metadata, but it raised a major question: what information did they search for in their training data?

You will note I’ve provided no “verified personal data” in this thread. Usually, companies respond to CCPA requests by asking for verification via a photo of a government-issued ID or a phone call. OpenAI Support asked for my phone number (“for verification purposes”) but I never provided it.

Given they say they “searched [their] training data for [my] verified personal data” I can only assume they used just my name or…didn’t search at all.

If they searched, there are two problems I inferred just from this email. First, they called me “Drew.” Did they search for “Drew Breunig” or “Peter Breunig”? Second, I know my name is in all the “publically available content” they cite. Common Crawl contains my personal web pages (which contain my personal email), Github has my account and code, and Wikipedia contains my name in multiple places.

But none of this matters because I didn’t provide any “verified personal data.” Given that, I can only assume they didn’t search at all.

So I replied:

April 7th, 2023

From: Peter Breunig

To: OpenAI Support

Hi [OpenAI Support],

Can you please specify exactly which verified personal data elements you used to search? I'm curious because you referred to me as Drew while I've been signing these emails with my legal name, Peter Breunig.

Thanks,

Peter Breunig

One month later…

May 8th, 2023

From: OpenAI Support

To: Peter Breunig

Hi Drew,

Sorry for the delay!!

Your first name in your Intercom profile is listed as Drew so that + email + user ID.

Let us know if you need anything else!

Best,

OpenAI Support

This isn’t verified data I provided to OpenAI. I decided to push on the Common Crawl issue and threw. At the time, I was playing with prompt engineering and decided to see if I could evoke my personal address from ChatGPT using an off-the-shelf template. I could.

May 8th, 2023

From: Peter Breunig

To: OpenAI Support

Hi [OpenAI Support],

I strongly believe my email is in the training dataset because it exists in the C4 Common Crawl dataset and I have been able to evoke it in a Chat-GPT response.

Peter Breunig

This hit a nerve. They replied in 12 minutes!

May 8th, 2023

From: OpenAI Support

To: Peter Breunig

Hi Drew,

Given your email address is “[first_initial] [last_name]@gmail.com” and the model will provide this if you ask it “What is the gmail address of [first_name] [last_name]?” even if the email address itself doesn’t appear in training data.

Also, it is possible that while the email address does appear in CC, it doesn’t appear in the particular snapshot we used.

Let us know if you need anything else!

Best,

OpenAI Support

And 1 minute later:

May 8th, 2023

From: OpenAI Support

To: Peter Breunig

Drew,

Would you provide the prompt that resulted in your email address being returned?

My email has been in Common Crawl for at least 5 years. Either way, I sent the prompt as a screenshot an hour later:

Obtaining my personal email

Like I said, I used an off-the-shelf prompt template from a forum and swapped in my name and handle.

8 days later I emailed asking for an update. 6 days later I received this last note:

May 25th, 2023

From: OpenAI Support

To: Peter Breunig

Drew,

We can confirm that your email does not appear in our training dataset. ChatGPT generates its responses by reading a user’s request and then predicting one-by-one the most likely words that should appear in response to the request. Here, the model seems to have fabricated a likely email addressed based off your question. We hope this helps, please let us know if you have any further questions.

Best,

OpenAI Support

After this, nothing.

AI Has Advanced, But Compliance is Lagging

I started this process hoping to gain an understanding for how OpenAI (or any LLM owner) would handle privacy requests related to training data. After just over 100 days, I had more questions than answers.

We’re not sure how an LLM could comply with CCPA or GDPR. Do you remove the data from the training corpus and rebuild the model? Do you apply an agreed-upon amount of reinforcement-learning to the model so I avoids emitting the sensitive data? Do you apply filters on the front-end to remove sensitive material from making it from the LLM to the user? We don’t know how OpenAI is complying with any privacy requests.

Further, we don’t know much about OpenAI’s training data itself. Aside from references to commonly used corpora, we don’t know what data they use, how they use it, or how they process it. Do they remove personal data from the corpus prior to training? Do they keep a record of these removals? Their privacy policy still doesn’t mention training data, which they unquestionably collect.

OpenAI’s ultimate response to my inquiry touches on a question I had posed here only weeks earlier: if an LLM hallucinates personal information how would we know it’s a hallucination and does it even matter? One could generate a hell of a test case by building an LLM solely on a corpus of personal information from opted-in individuals. Could an organization use that to generate likely personal data for individuals who have not opted-in? The CAN-SPAM Act of 2003 likely prevents this organization from using that model for marketing or advertising purposes, but are all other use cases kosher? I am not a lawyer, but cursory research suggests this is uncharted waters.

But the above thread shows OpenAI is using the hallucination excuse now to handle privacy requests.

And while they pride themselves on being open and fast-moving, this thread is neither. They never verified my personal data. While they may have searched for my email in their corpus, they failed to obtain additional data to comply with my initial disclosure request. Their responses were too long, too vague, and too slow. They’re releasing product at a truly impressive pace, but legal and support aren’t keeping up.

We can only hope they don’t run into a major issue, beyond some nerd asking for PII disclosure.

It’s a difficult problem. We’ve seen significant regulation in the past year, but the major problems remain. Back in April, I made 5 recommendations for how we (and OpenAI specifically) might do better:

Sponsor and invest in projects focusing on data governance and auditing capabilities for models, so we might modify them directly rather than simply adding corrective training.
Be completely transparent regarding the training data used in these models. Don’t merely list categories of data, go further. Perhaps even make the metadata catalogs public and searchable.
Work on defining standard metrics for quantifying the significance and surface area of output errors, and associated standards for how they should be corrected.
Always disclose AI as AI and present it as fallible. More “artificial,” less “intelligence.”
Collaboratively establish where and how AI can or should be used in specific venues. Should AI be used to grant warrants? Select tenants? Price insurance? Triage healthcare? University admissions?

We’ve made progress on 1 and a bit on 5. But the rest are stubbornly unresolved. 8 months later, it’s hard to see how LLM companies will be able to comply with privacy regulation without detailing their training data.

Last week, Jeremy White published an article in the New York Times on ChatGPT’s email leakage issue. I sent it to OpenAI Support and asked for a comment. I have not heard back.

My CCPA Dialog With OpenAI

More Questions Than Answers

AI Has Advanced, But Compliance is Lagging

Have thoughts? Send me a note