Former Cloudflare exec launches archive of pre-AI human content in time capsule-style move

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

Former Cloudflare executive John Graham-Cumming has launched lowbackgroundsteel.ai, a catalog that preserves pre-2022 human-generated content from before widespread AI contamination began. The archive draws its name from scientists who once sought “low-background steel” from pre-nuclear shipwrecks to avoid radiation contamination, creating a parallel between nuclear fallout and AI-generated content polluting the internet.

The big picture: The project treats pre-AI content as a precious commodity, recognizing that distinguishing between human and machine-generated material has become increasingly difficult since ChatGPT’s November 2022 launch.

Why this matters: AI contamination has already forced at least one major research project to shut down entirely—wordfreq, a Python library that tracked word frequency across 40+ languages, announced in September 2024 it would stop updating because “the Web at large is full of slop generated by large language models, written by no one to communicate nothing.”

What’s included: The archive points to several major repositories of verified pre-AI content that researchers and developers can trust.

A Wikipedia dump from August 2022, captured before ChatGPT’s release.
Project Gutenberg’s collection of public domain books.
The Library of Congress photo archive.
GitHub’s Arctic Code Vault—open source code buried in a former coal mine near the North Pole in February 2020.
The now-frozen wordfreq project, preserved from before AI contamination made its methodology untenable.

Model collapse concerns: Some researchers worry about AI models training on their own outputs, potentially degrading quality over time, though recent evidence suggests this fear may be overblown under certain conditions.

Research by Gerstgrasser et al. (2024) indicates model collapse can be avoided when synthetic data accumulates alongside real data rather than replacing it entirely.
Properly curated synthetic data can actually assist with training newer, more capable models when combined with real data.

The backstory: Graham-Cumming created the website in March 2023 but only recently announced it publicly, having kept it as a quiet clearinghouse for uncontaminated online resources.

He’s known for creating POPFile spam filtering software and successfully petitioning the UK government to apologize for persecuting codebreaker Alan Turing in 2009.
The site accepts new submissions through its Tumblr page.

Looking ahead: Graham-Cumming emphasizes the project documents human creativity rather than opposing AI itself, similar to how low-background steel eventually became unnecessary as atmospheric nuclear testing ended and radiation levels normalized.

Why one man is archiving human-made content from before the AI explosion

Ars Technica

Menu

Former Cloudflare exec launches archive of pre-AI human content in time capsule-style move

Recent News

Study reveals AI models can hide malicious reasoning while coding

On the up and up, and up: ChatGPT reaches 700M weekly users as AI adoption accelerates

Amphenol acquires CommScope’s fiber unit for $10.5B as AI drives connectivity demand

Join the revolution

CO/AI

Resources

Join the revolution

Menu

Welcome

Former Cloudflare exec launches archive of pre-AI human content in time capsule-style move

Recent News

Study reveals AI models can hide malicious reasoning while coding

On the up and up, and up: ChatGPT reaches 700M weekly users as AI adoption accelerates

Amphenol acquires CommScope’s fiber unit for $10.5B as AI drives connectivity demand

Join the revolution

CO/AI

Resources

Join the revolution