Reddit has blocked the Internet Archive from indexing its content after discovering AI companies were circumventing Reddit’s scraping restrictions by harvesting data from archived pages instead. The move effectively eliminates a key resource for researchers and users who relied on archived Reddit content to track deleted posts and preserve community discussions.
What you should know: The Internet Archive’s Wayback Machine can now only capture screenshots of Reddit’s homepage, not individual threads, profiles, or comments.
- Previously, the Wayback Machine served as a comprehensive backup of Reddit content, documenting everything from deleted posts to user activity across various subreddits.
- Moving forward, the archive will only provide daily snapshots of popular posts and news headlines, severely limiting its utility for research and content preservation.
The big picture: Reddit is leveraging privacy concerns and AI scraping violations to justify restrictions that could drive more lucrative licensing deals with AI companies.
- Tim Rathschmidt, a Reddit spokesperson, cited “instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine” as justification for the block.
- The company also raised longstanding privacy concerns about the Wayback Machine archiving content that users have deleted.
Why this matters: The restriction removes a crucial tool for internet research and content preservation at a time when digital platforms increasingly control access to public discourse.
- Redditors have historically used the Wayback Machine to research deleted comments and preserve content during platform changes, such as the 2023 API modifications that threatened beloved subreddits.
- The move comes as Reddit expects to generate more than $200 million over three years from AI licensing deals, following agreements with OpenAI and a reported $60 million deal with Google.
What they’re saying: Reddit suggests the Internet Archive could take steps to address the AI scraping problem and potentially restore access.
- “Until they’re able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content) we’re limiting some of their access to Reddit data to protect redditors,” Rathschmidt said.
- Mark Graham, director of the Wayback Machine, confirmed that the Internet Archive has “a longstanding relationship with Reddit” and continues to have “ongoing discussions about this matter.”
The broader context: This represents another example of how AI training data disputes are reshaping internet archiving and research capabilities.
- Multiple tools exist for surfacing deleted Reddit posts, though some users noted the Wayback Machine wasn’t necessarily the easiest platform for that purpose.
- The Internet Archive has not indicated whether it’s exploring technical fixes to address Reddit’s concerns and restore full archiving capabilities.
Reddit blocks Internet Archive to end sneaky AI scraping