Loading...
Loading...
Major publishers are increasingly blocking the Internet Archive’s Wayback Machine and related APIs to curb AI scraping and model training on copyrighted news. But critics argue the move won’t meaningfully stop AI—content can be gathered elsewhere—while it will degrade a vital public record used by journalists, researchers, and courts to verify what was published and when. Partial blocks (e.g., leaving homepages but filtering articles or paywalled pages) are making the web “increasingly unarchivable,” with nonprofit preservation efforts and datasets like donated crawls becoming collateral damage. The fight spotlights growing tension between copyright enforcement, AI data sourcing, and digital preservation.
The Electronic Frontier Foundation warns that major publishers blocking the Internet Archive’s crawlers—most notably The New York Times and reportedly The Guardian—threatens the Wayback Machine’s role preserving web history. Publishers say they’re preventing scraping to stop AI companies from training models on copyrighted news content and to control downstream uses; some have sued AI firms over alleged infringements. The EFF argues nonprofit archives aren’t commercial AI builders and that cutting off archival crawls erases a crucial historical and journalistic record, undermining transparency, accountability, and research. The dispute highlights tensions between copyright enforcement, AI training practices, and the public interest in preserving digital news.
The Internet Archive’s Wayback Machine—home to over one trillion archived web pages used by journalists, researchers, and courts—is losing access to major publishers as The New York Times and others block its crawlers to prevent AI companies from scraping news content. The moves, prompted by publishers’ concerns and lawsuits over AI training on copyrighted material, risk erasing the web’s historical record because archived pages often preserve original versions of articles that are later edited or removed. The piece argues that archiving and searchable indexes are legally protected as fair use, and that blocking nonprofit preservation efforts to curb AI access would sacrifice decades of public documentation for a dispute that should be settled in court. This matters because it threatens research, journalism, and legal accountability tied to persistent web archives.
Blocking Internet Archive Won't Stop AI, but Will Erase Web's Historical Record | Hacker News Hacker News new | past | comments | ask | show | jobs | submit login Blocking Internet Archive Won't Stop AI, but Will Erase Web's Historical Record ( eff.org ) 11 points by pabs3 2 hours ago | hide | past | favorite | discuss help Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact Search:
The Internet Archive’s Wayback Machine—home to over a trillion archived web pages used by journalists, researchers, and courts—is losing access to major publishers after The New York Times and others began blocking its crawlers to prevent AI scraping. The move, driven by publisher concerns about AI models being trained on copyrighted news, risks erasing a decades‑long historical record of how stories originally appeared online. The article argues that archiving and searchable copying have established fair‑use precedent (citing past cases like Google Books) and that nonprofit preservation serves a transformative, public‑interest purpose distinct from commercial AI training. It warns that cutting off the Archive to control AI access would harm future research and the public record.
The Internet Archive is facing legal and access challenges as some publishers and rightsholders seek to block its Wayback Machine, arguing archived content fuels AI training and copyright infringement. The Archive’s defenders, including historians, librarians, and web preservation advocates, warn that blocking it won’t stop AI development—models will still be trained on other web copies—but will erase the historical record of web content, harming research, journalism, and cultural memory. The dispute highlights tensions between copyright enforcement, AI data sourcing, and public-interest preservation. Key players include the Internet Archive, publishers/rightsholders, and the AI industry, and the outcome could shape web archiving practices, legal precedents, and access to digital history.
The article emphasizes the critical role of the Wayback Machine in preserving web history, highlighting its utility in recovering lost website versions, verifying publication dates, and combating disinformation. The author argues against blocking web archiving, likening it to destroying historical artifacts. This perspective underscores the importance of maintaining access to digital archives, especially as companies and publications frequently change ownership or disappear. The discussion raises awareness about the potential loss of valuable online information and the implications for transparency and accountability in the digital age.
The Internet Archive’s Wayback Machine lists an archived page titled “How to fold the Blade Runner origami unicorn (1996),” but the provided content contains only capture metadata rather than the article itself. According to the Wayback entry, the URL (linkclub.or.jp/~null/index_br.html) has 47 recorded captures spanning from 04 Nov 2001 to 17 Feb 2026. The capture shown is timestamped 2001-11-04 01:59:33. The page was collected as part of “Alexa Crawls,” reflecting Alexa Internet’s donation of crawl data to the Internet Archive beginning in 1996, and is associated with the “Alexa Crawl DH” collection, which is noted as not publicly accessible. With no instructional text included, details of the origami unicorn folding guide cannot be summarized from the supplied material.
Ways of Working with the Wayback Machine
News publishers are tightening access to the Internet Archive amid fears that AI companies can use its repositories as a backdoor for scraping. In a Jan. 28, 2026 Nieman Journalism Lab report, The Guardian said access logs showed the Internet Archive was a frequent crawler, prompting it to exclude itself from the Archive’s APIs and filter article pages from the Wayback Machine’s URLs interface, while leaving homepages and landing pages available. The Financial Times similarly blocks bots attempting to scrape paywalled content, including those from OpenAI, Anthropic, Perplexity, and the Internet Archive, meaning mostly unpaywalled FT stories appear in the Wayback Machine. Researchers warn “good” archiving projects like the Internet Archive and Common Crawl are becoming collateral damage, potentially making the web harder to preserve.