Closed model vendors won’t tell you what’s in their training data

Simon Willison:

One of the most common concerns I see about large language models regards their training data. People are worried that anything they say to ChatGPT could be memorized by it and spat out to other users. People are concerned that anything they store in a private repository on GitHub might be used as training data for future versions of Copilot.

When someone asked Google Bard how it was trained back in March, it told them its training data included Gmail! This turned out to be a complete fabrication—a hallucination by the model itself—and Google issued firm denials, but it’s easy to see why that freaked people out.

I’ve been wanting to write something reassuring about this issue for a while now. The problem is… I can’t do it. I don’t have the information I need to credibly declare these concerns unfounded, and the more I look into this the murkier it seems to get.

The fundamental issue here is one of transparency. The builders of the big closed models—GPT-3, GPT-4, Google’s PaLM and PaLM 2, Anthropic’s Claude—refuse to tell us what’s in their training data.

Given this lack of transparency, there’s no way to confidently state that private data that is passed to them isn’t being used to further train future versions of these models.

I’ve spent a lot of time digging around in openly available training sets. I built an early tool for searching the training set for Stable Diffusion. I can tell you exactly what has gone in to the RedPajama training setthat’s being used for an increasing number of recent openly licensed language models.

But for those closed models? Barring loose, high-level details that are revealed piecemeal in blog posts and papers, I have no idea what’s in them.