
Wikipedia Fights AI Scraper Surge as Costs Skyrocket | Image Source: www.theregister.com
NEW YORK, USA, April 2, 2025 – Wikipedia, one of the most beloved and reliable knowledge deposits on the Internet, is now looking for an unexpected existential threat – not disinformation or the weaker volunteers, but the same technology built on its success: artificial intelligence. While AI companies scratch the huge Wikipedia catalog to form their data-hungry models, the Wikimedia Non-Profit Foundation is in balance under operating costs.
According to the Wikimedia Foundation, since January 2024, bandwidth usage has increased by 50%, mainly thanks to automated robots that download multimedia content. While occasional users may assume that this traffic comes from millions of Wikipedia daily readers, the truth is more complicated - and disturbing. They are not human researchers; These are web-based skyscrapers acting on behalf of AI developers who want to celebrate in data open to business models of machine learning.
Why do IA robots target Wikipedia?
The open nature of Wikipedia makes it an irresistible resource for the formation of AI models. With over 60 million articles available in nearly 300 languages, it is essentially a set of clean, labelled and well-organized data that covers a wide range of human knowledge. Most importantly, Wikimedia Commons – an image repository and multimedia linked to Wikipedia – offers open-label multimedia files that are key targets for AI visual training.
According to Wikimedia representatives Birgit Mueller, Chris Danis and Giuseppe Lavertto, the problem lies not only in volume but in behaviour. These robots do not mimic real user models. They pass through dark and rarely accmeasures pages, passing through regional cache centres and testing the central infrastructure. This not only increases network traffic, but requires more computing power, which ultimately significantly inflates operating costs.
What is the real cost to Wikipedia?
The burden is not only technical, it is financial. The Wikimedia Foundation, which already operates on a meagre budget financed by donations, must now take into account the circulation patterns that have never been planned. The Foundation noted that while robots represent only 35 percent of page views, they represent 65 percent of the bandwidth for their most expensive content to serve. This asymmetrical load reveals how much IA robots are more resource-intensive than human users.
“Our infrastructure is built to maintain sudden peaks of human traffic at high-interest events,” said Wikimedia Foundation in a recent planning document, “but the amount of traffic generated by skyscrapers robots is unprecedented and presents increasing risks and costs. “
And unlike humans, these robots don’t just stick to popular content. They are aggressively oriented towards long-term objects, development infrastructures, fault trackers and even Wikimedia code revision platforms, so that they are less frequently accessible by users, but vital to the integrity of the organization’s technical ecosystem.
How does Wikipedia react?
To counter this growing pressure, Wikimedia has launched a multifaceted plan as part of the “Responsible Use of Infrastructure” strategy set out in its annual planning document 2025/2026. Target? Reduce scraper traffic by 20% in terms of application rates and 30% in terms of bandwidth consumption. But this is not a direct way.
According to the Foundation, some initial steps include case-by-case blocking and limiting offensive robots. However, this tactic is a group aid rather than a remedy. Many aggressive skyscrapers can escape detection simply by changing their user-agent chain or imitating legitimate services like Googlebot. In addition, application through robots. The txt protocol proved unreliable because many robots or ignore or morf identities to remove their restrictions.
“Our content is free, our infrastructure is not: We must act now to restore a healthy balance”
wikimedia Foundation cautioned, stressing the urgent need for a broader structural response.
Is it just a problem with Wikipedia?
Wikipedia is far from alone in this battle. Other community or open content platforms such as Reddit, Sourcehut and iFixit also raised red flags. In 2023, Reddit famously closed horns with Microsoft after discovering its content was scratched without consent. Director General of Reddit Steve Huffman did not look at the words, calling the theme “a real pain in the ass.” In the end, Reddit responded by loading developers for API access, triggering a massive reaction and temporary subreddit failures.
The same friction develops in the open source landscape. Projects such as ReadTheDocs and the Diaspora have implemented countermeasures or issued warnings about excessive traffic in robots. Developers also use innovative defensive tools such as Glaze, Nightshade and ArtShield, tools designed to inject misleading data or confuse AI models during training. Network-based solutions like Kudurru and AI Laberrinth also gain traction as shields against unauthorized data extraction.
Why aren’t AI farmers completely blocking?
We might ask: If the problem is so serious, why isn’t Wikipedia completely blocking AI trackers? The answer is to balance openness and sustainability. Wikimedia projects are based on a free and open access ethic. The closure of AI trackers could completely contradict these fundamental principles. This is a philosophical and practical dilemma.
Moreover, many of the worst criminals simply do not identify with precision. They claim to be search engine robots or legitimate services, making it difficult to escape without collateral damage. As the Registry pointed out, Wikipedia robots. txt blocks some known bad actors, but obviously omits entries for Google, OpenAI, and Anthropics, strongly collaborates in the development of AI. Without proper scraper identification, even robust technical solutions are short.
Can he coexist with open knowledge?
This situation highlights an ethical problem that the technology industry has far exceeded: should companies A pay for the resources they consume? Currently AI companies benefit greatly from free resources such as Wikipedia. But their operations have hidden costs – the costs borne by the same communities that make the Internet worth exploring.
As the music industry has finally sought licensing agreements with streaming platforms, it may be time for AI companies to explore revenue sharing models or access rates for large-scale data usage. After all, AI is no longer in its childhood. It is a multi-million dollar industry with deep pockets and growing influence. If it remains dependent on freely available public resources, it must also be prepared to return them.
As Birgit Mueller pointed out in the public post of the Foundation,
“We must prioritize who we serve with these resources, and we want to promote human consumption. »
It doesn’t mean closing the doors. But this means that AI developers must stop dealing with open source platforms like all the buffets you can eat.
What’s next for Wikipedia?
In the coming months, the Wikimedia Foundation plans to gather feedback from the community on how best to identify and regulate AI-related traffic. Proposals include the requirement for robot authentication by requesting high volume access or API use, and the use of smarter detection algorithms to differentiate between harmless robots and predators.
It also examines broader transparency obligations for AI companies. If a company deletes millions of documents, should it be necessary to reveal its intention? Should I compensate for the platforms you use? These are not just operational issues, they are ethical and regulatory issues that could shape the future of digital knowledge.
Ultimately, Wikipedia’s current crisis is not just about robots and bandwidth. These are values. The Internet was built on the ideals of sharing, openness and community. But as artificial intelligence is increasingly commercialized, it may erode the same pillars that have allowed it to thrive. Unless businesses, decision makers and digital communities find common ground, friction will only increase.
For now, Wikipedia is drawing a line in the digital sand, not to prevent people, but to ensure that lights are kept for people who have always been meant to serve.