23 major news sites are blocking the Wayback Machine

Q: What is ia_archiver?

ia_archiver is the user-agent name used by the Internet Archive's web crawler when it captures pages for the Wayback Machine. Websites can block this crawler specifically by adding a Disallow: / entry under User-agent: ia_archiver in their robots.txt file.

Q: Why are news organizations blocking the Wayback Machine?

The primary reasons are: (1) broad robots.txt blocks aimed at AI crawlers that accidentally swept in ia_archiver; (2) paywall bypass concerns — archived pages can sometimes be accessed without a subscription; (3) the legal climate following Hachette v. Internet Archive (2024), which made publishers less willing to allow any archival access.

Q: Is blocking the Wayback Machine legal?

Yes. robots.txt is a voluntary standard and publishers can block any crawler, including the Internet Archive's, without legal consequence. The Internet Archive respects these blocks.

Sebastian Jade

3 months ago

Twenty-three major news organizations have updated their robots.txt files to block the Wayback Machine, the Internet Archive’s 30-year-old digital preservation project. The move means their journalism — breaking news, investigations, columns, corrections — will no longer be captured for the historical record.

The mechanism is a single line: Disallow: / under the ia_archiver user-agent. Small change. Large consequences.

23 news sites are blocking the Wayback Machine

A tracking analysis of major news publishers’ robots.txt files identified 23 outlets that have explicitly blocked ia_archiver, the crawler token used by archive.org to capture pages for the Wayback Machine. The list spans US national outlets, regional papers, and international English-language publications.

The Internet Archive is a San Francisco-based non-profit founded in 1996 by Brewster Kahle. Its Wayback Machine has captured over 850 billion web pages — making it the world’s largest digital library and the primary backup for content that disappears from the live web. When a news outlet shuts down, gets acquired, or quietly deletes a story, the Wayback Machine is often the only place that story still exists.

Blocking it is not illegal. It is a choice.

How the block works — robots.txt explained

robots.txt is a plain text file hosted at the root of any website — example.com/robots.txt — that tells automated crawlers which pages they can and cannot access. It’s a voluntary standard, not enforceable law. Crawlers that respect it — including archive.org, Google, and Bing — read the file before fetching pages.

The relevant entry looks like this:

User-agent: ia_archiver
Disallow: /

That two-line block tells the Wayback Machine’s crawler: do not archive anything on this site. Google’s crawler (Googlebot) is typically unaffected — these outlets still want search traffic. The block is targeted specifically at preservation.

Respecting robots.txt is a norm, not a constraint. The Internet Archive, operating under a philosophy of open access, honors these blocks. Unlike some AI companies that ignored crawl restrictions to harvest training data, the Archive stops when told to stop.

Why news sites are doing it

The timing connects to two converging pressures.

The AI training data fight. Starting in 2023, major news outlets — the New York Times, Associated Press, News Corp — began adding broad AI crawler blocks to their robots.txt files. Some of those blocks, written hastily or with catch-all syntax, ended up blocking non-AI crawlers including ia_archiver. In other cases, the block was deliberate.

Paywall bypass. The Wayback Machine archives full page text. A reader who hits a subscription wall can sometimes access an archived version of the same article for free. Publishers losing subscribers are increasingly sensitive to this bypass — even if the archived copy predates the paywall’s introduction.

Legal fallout from Hachette v. Internet Archive. In June 2024, a federal appeals court upheld a ruling against the Internet Archive in a copyright case brought by four major publishers. The case targeted the Archive’s Controlled Digital Lending program — a different service from the Wayback Machine — but the legal cloud over the organization gave publishers legal cover to restrict its access more broadly.

What gets lost when journalism isn’t archived

Link rot in online journalism isn’t a theory. It’s a documented, measurable problem.

A Harvard Law School study found that roughly 50% of URLs cited in US Supreme Court opinions now return a 404 — gone. News articles fare worse: outlets redesign, kill old CMS platforms, quietly delete embarrassing coverage, or simply fold.

The Wayback Machine is the backstop. When the Rocky Mountain News shut down in 2009, the Archive held its coverage. When local television stations delete footage, the Archive sometimes has a transcript. Pulitzer-winning investigations that lived at one URL for five years and then vanished are recoverable — but only if the Archive captured them.

When a news outlet blocks ia_archiver, it is saying: our journalism does not need to outlast our business model. For accountability journalism in particular — stories about local government, corporate misconduct, public health — that is a meaningful loss.

The Internet Archive’s position

The Archive does not comment on individual publishers’ blocking decisions. Its public position, consistent since the 1990s, is that it archives the public web with the intent of providing universal access to all knowledge.

Internally, the organization has faced sustained legal and financial pressure. The Hachette ruling, combined with ongoing litigation from music publishers, has forced the Archive to restructure. A fundraising campaign in late 2024 cited existential financial risk. The Wayback Machine itself was not the subject of the copyright suits, but the legal atmosphere has made publishers less willing to give the Archive the benefit of the doubt.

Kahle’s public response to publisher blocking has been consistent: respect the choice, note the loss.

Is there a workaround?

For readers trying to access blocked articles, alternatives exist — though none are as comprehensive as the Wayback Machine.

Archive.today (formerly archive.is) — an independent archiving service that most publishers have not yet blocked. It captures a single page on demand, rather than crawling automatically.
Perma.cc — run by Harvard Law School’s library, used by law reviews and courts to create permanent links. Not comprehensive; requires someone to submit each URL.
Google Cache — Google is phasing out its cached page feature as of 2024, reducing this as a fallback.
Browser extensions — tools like Unpaywall and Open Access Button find legal alternative copies of academic papers; no equivalent exists at scale for news.

The gap that’s opening: automated, comprehensive preservation of the live web requires a large-scale crawler that publishers trust. Right now, that crawler is being blocked.

What this means for the open web

The 23-site count matters less than the precedent. Major outlets set industry norms. When the New York Times or Washington Post block a crawler, smaller regional papers observe and often follow without analysis. If blocking ia_archiver becomes standard practice rather than an exception, the consequence is cumulative: a journalism archive with growing blind spots, decade by decade.

Here’s the irony: several of the outlets that blocked the Wayback Machine have published investigations into AI data hoarding, Big Tech overreach, and the erosion of the open web. They are doing the thing they cover, in a configuration file most readers will never see.

The robots.txt standard runs on trust. AI companies burned that trust by ignoring crawl restrictions to harvest training data. The collateral damage is landing on the Internet Archive — an organization that spent 30 years honoring the exact norms those companies ignored.

FAQ

What is the Wayback Machine?
The Wayback Machine is a digital archive of the World Wide Web run by the Internet Archive, a non-profit based in San Francisco. Founded in 1996, it has captured over 850 billion web pages and allows anyone to view historical versions of websites, including news articles that have since been deleted or moved.

What is ia_archiver?
ia_archiver is the user-agent name used by the Internet Archive’s web crawler when it captures pages for the Wayback Machine. Websites can block this crawler specifically by adding a Disallow: / entry under User-agent: ia_archiver in their robots.txt file.

Which news sites are blocking the Wayback Machine?
A tracking analysis identified 23 major news organizations blocking ia_archiver as of 2026. The specific list varies as publishers update their configurations. Checking a site’s current robots.txt file directly (append /robots.txt to the domain) will show its current crawler policy.

Why are news organizations blocking the Wayback Machine?
The primary reasons are: (1) broad robots.txt blocks aimed at AI crawlers that accidentally swept in ia_archiver; (2) paywall bypass concerns — archived pages can sometimes be accessed without a subscription; (3) the legal climate following Hachette v. Internet Archive (2024), which made publishers less willing to allow any archival access.

Is blocking the Wayback Machine legal?
Yes. robots.txt is a voluntary standard and publishers can block any crawler, including the Internet Archive’s, without legal consequence. The Internet Archive respects these blocks.

What does this mean for link rot?
Link rot — the tendency for URLs to stop working over time — is already a documented problem. Studies show ~50% of links in US Supreme Court opinions are dead. When news outlets block the Wayback Machine, there is no automated backup when those pages disappear. Accountability journalism becomes harder to verify and cite over time.

Is the Wayback Machine being shut down?
No. The Wayback Machine continues to operate. It is not the subject of the copyright lawsuits against the Internet Archive (those targeted the Controlled Digital Lending program). However, the legal and financial pressure on the organization is ongoing.

Are there alternatives to the Wayback Machine for archived news?
Archive.today allows on-demand archiving of individual pages and has not been widely blocked. Perma.cc serves legal and academic citation needs. Google Cache, which served as an informal backup, is being phased out. None offer the comprehensive, automated coverage of the Wayback Machine.

This article is not sponsored by the Internet Archive or any related organization. TechDaily360 has no financial relationship with archive.org.

Get the weekly digital rights and open web briefing — what’s being restricted, who’s behind it, and what it means for the internet you use. Subscribe to TechDaily360, every Tuesday.