More than 340 newspapers block the Wayback Machine. What that means for the future of the internet

0
More than 340 newspapers block the Wayback Machine. What that means for the future of the internet

The internet is forever, people warn. But leading newspaper companies could change that in a burgeoning war against artificial intelligence chatbots

More than 340 national and local news publishers, from The New York Times to The Idaho Statesman, have blocked the Wayback Machine from archiving their stories due to concerns their proprietary content could get gobbled up by Big Tech companies to train AI tools like ChatGPT. 

That’s according to a new analysis by the Nieman Journalism Lab at Harvard University, which first reported that leading publishers across the U.S. have cut off access to the digital tool designed to preserve a historic record of the internet in the public interest. 

There’s no evidence tech companies are training AI models with archives from the Wayback Machine, which the nonprofit Internet Archive launched 30 years ago, and has since cataloged more than a trillion web pages. But the digital library is routinely used by historians, researchers and journalists — some of whom are now petitioning their employers to allow the Wayback Machine to preserve their work. 

“The Internet Archive is a national treasure,” Rachel Maddow, the MS Now host and progressive commentator, wrote in an online petition that has been signed by more than 200 journalists nationwide. “I use it daily, and have for many, many years. I cannot imagine doing the work I do without it.”

How does the Wayback Machine work — and why are publishers pushing back? 

Unbiased. Straight Facts.TM

Since its launch in 1996, the Wayback Machine has archived more than 1 trillion web pages in a bid to preserve the internet.

In 1996, the Internet Archive launched its first web crawlers to scrape content from a wide range of websites. The idea was that the content could soon change — or disappear entirely — and would be forever lost to history. 

To function, the tool deploys automated web crawlers that upload snapshots of web pages into an online repository accessible to anyone. Internet users, including journalists and historians, often upload web pages themselves that the tools’ bots hadn’t yet captured. 

It turns out the internet is not, in fact, forever, even with the Wayback Machine.

In 2024, a Pew Research Center study found that 38% of web pages that appeared on the internet in 2013 were no longer online a decade later. 

Yet, national newspaper publishers have increasingly viewed the digital library as an existential threat. In January, Nieman Lab reported that newspaper publishers like The Guardian and The New York Times had taken steps to exclude their content from the Wayback Machine. 

“We believe in the value of The New York Times’s human-led journalism” and took action to ensure its website is “being accessed and used lawfully,” a spokesperson for the newspaper told Nieman Lab. “We are blocking the Internet Archive’s bot from accessing the Times because the Wayback Machine provides unfettered access to Times content — including by AI companies — without authorization.”

At the Times and many other publications, that content is still available — but only to paying subscribers. 

But by this month, the number of U.S. news websites throttling access to the Wayback Machine had grown to more than 340. Most are local outlets controlled by just five conglomerates: USA Today Co., McClatchy, Advance Local, MediaNews Group and Tribune Publishing. 

Eighty percent are local newspaper websites owned by USA Today. 

The Wayback Machine also faced pushback in 2024, when four major publishing houses — Hachette, HarperCollins, Penguin Random House and Wiley — successfully sued under federal copyright law to block the nonprofit from continuing a project to scan, digitize and lend library books.

Journalists rally to preserve the internet’s library

In an op-ed in February, Wayback Machine’s director, Mark Graham, wrote that publishers’ concerns were “understandable, but unfounded.” 

“The Wayback Machine is not intended to be a backdoor for large-scale commercial scraping and, like others on the web today, we expend significant time and effort working to prevent such abuse,” Graham wrote in the blog TechDirt. “Whatever legitimate concerns people may have about generative AI, libraries are not the problem, and blocking access to web archives is not the solution; doing so risks serious harm to the public record.”

In turn, journalists at news outlets across the country have become the Wayback Machine’s most vocal champions. 

In the petition launched by Fight for the Future, a nonprofit digital rights group, Wayback Machine supporters said the tool is an “essential service for journalism.” 

“Without preference or bias, the Internet Archive preserves the historical record of our times,” according to the petition. “Journalists rely on the Archive as a resource in our reporting, and many digital investigations into issues like misinformation or censorship are possible only because it preserves material that would otherwise disappear.”


Round out your reading

Ella Rae Greene, Editor In Chief

Leave a Reply

Your email address will not be published. Required fields are marked *