Tip Sheets

Misalignment between Reddit, users over data use could be ‘catastrophic’

Media Contact

Becka Bowyer

Reddit has reportedly struck a deal to provide its content for the purpose of training Google’s artificial intelligence models. Charging for data access was a flash point for protests last year.


Sarah Gilbert

Research associate and Research Director of the Citizens and Technology Lab

Sarah Gilbert, research associate at Cornell University and research director of the Citizens and Technology Lab, is an expert on content moderation and data ethics. She studies Reddit as well as the labor of community moderators and how Reddit users think about re-use of their content.

Gilbert says:

“Reddit content makes good training data. For one, there’s a lot of it. Unlike platforms such as X (formerly Twitter), Reddit’s character limits are high. Its threading system supports in-depth and lengthy conversations between users. Reddit data also comes pre-organized. The site is structured into different topic-based communities, which makes it easier for companies to use its data to train AI that supports specific purposes. However, Reddit’s primary data asset is its moderation. It’s the work of volunteers that make Reddit data trustworthy, and reduce the amount of spam, hate and other harmful content rife in training data from other platforms.

“It makes sense that Google would want to buy Reddit data, particularly given the symbiotic relationship between the two companies. However, what is less clear is how users will feel about their data being used to train generative AI. Prior research by my colleagues and I have found that Reddit users highly value privacy, which makes sense, given that most people contribute pseudonymously. We also found that how they feel about data use varies by context. For example, we found that users would be uncomfortable with use of private data, such as direct messages.

“However, the terms of the deal with Google are unclear, and Reddit has yet to publish a public-facing data use policy that outlines what data is being sold and how it can be used. A misalignment between the expectations of users and how Reddit allows their data to be used could be catastrophic for Reddit. It could impact willingness to contribute to the site or even prompt users to engage in vandalism as a form of protest. To avoid repeating the mishaps of Facebook and to maintain the value of its data, Reddit should develop a clear policy in conjunction with its users and be open about what data is being used and how.” 

Cornell University has television, ISDN and dedicated Skype/Google+ Hangout studios available for media interviews.