The Wikimedia Foundation, the management organization of Wikipedia and more than a dozen other crowdsourced knowledge projects, said on Monday that bandwidth consumption for downloading multimedia from Wikimedia Commons has surged 50% since January 2024. The reason stems not from growing demand from knowledge-hungry humans but from automated, data-hungry crawlers that want to train artificial intelligence models, the company wrote in a blog post on Tuesday. “Our infrastructure is built to withstand sudden surges in traffic from humans during high-profile events, but the volume of traffic generated by bots is unprecedented and comes with increasing risks and costs,” the post reads. Wikimedia Commons is a freely accessible repository of images, video and audio files that are available under open licenses or are in the public domain. Digging deeper, Wikipedia says that nearly two-thirds (65%) of the most "expensive" traffic (i.e., the most resource-intensive in terms of the type of content consumed) comes from bots. Yet only 35% of overall page views come from these bots. According to Wikipedia, the reason for this disparity is that frequently accessed content is stored closer to the user in its cache, while other, less frequently accessed content is stored farther away in "core data centers," from where it costs more to serve the content. This is the type of content that bots typically seek out. "While human readers tend to focus on specific (often similar) topics, crawler bots tend to 'batch read' large numbers of pages and visit less popular pages," Wikipedia wrote. "This means that these types of requests are more likely to be forwarded to core data centers, making them more expensive for our resources." All in all, the Wikimedia Foundation’s Site Reliability Team has to spend a lot of time and resources blocking bots to avoid disruption to regular users. And that’s before considering the cloud costs the Foundation faces. In fact, it represents part of a rapidly growing trend that is threatening the existence of the open internet. Last month, software engineer and open source advocate Drew DeVault complained that AI crawlers were ignoring “robots.txt” files designed to protect against automated traffic. And “pragmatic engineer” Gergely Orosz last week complained that AI crawlers from companies like Meta were increasing bandwidth demands on his own projects. While open source infrastructure is particularly at the forefront, developers are fighting back with “ingenuity and a vengeance.” Some tech companies are also doing their part to address the problem — Cloudflare, for example, recently launched AI Labyrinth, which uses AI-generated content to slow down crawlers. However, this is more of a cat-and-mouse game that could ultimately force many publishers to hide behind logins and paywalls — something that would be detrimental to everyone using the web today. |
Dahe.com (Reporter Mo Shaohua) Chanjuan, Yu Gou, ...
To do short video marketing, we need to understan...
I believe everyone is familiar with the term &quo...
Artificial intelligence (AI) is shaping the scien...
Friends who operate Douyin can use the Douyin pla...
After the mobile phone battery explosion incident...
Whenever we came out of the bathroom, staggering ...
Xiao Y, a post-95s youth, receives many enthusias...
Expert of this article: Hu Zhongdong, Chief Physi...
[51CTO.com Quick Translation] The rapid populariz...
1. Case Study Xiao Ming has been signing in at a ...
gossip “Do I need to replace sunglasses every two...
The whole text is full of valuable information an...
Professor Bao Aimin: 20 Lectures on New Knowledge...
Why are my ads always performing poorly? I think ...