Wikimedia Foundation: AI crawlers cause Wikimedia Commons bandwidth demand to surge 50%

Wikimedia Foundation: AI crawlers cause Wikimedia Commons bandwidth demand to surge 50%

The Wikimedia Foundation, the management organization of Wikipedia and more than a dozen other crowdsourced knowledge projects, said on Monday that bandwidth consumption for downloading multimedia from Wikimedia Commons has surged 50% since January 2024.

The reason stems not from growing demand from knowledge-hungry humans but from automated, data-hungry crawlers that want to train artificial intelligence models, the company wrote in a blog post on Tuesday.

“Our infrastructure is built to withstand sudden surges in traffic from humans during high-profile events, but the volume of traffic generated by bots is unprecedented and comes with increasing risks and costs,” the post reads.

Wikimedia Commons is a freely accessible repository of images, video and audio files that are available under open licenses or are in the public domain.

Digging deeper, Wikipedia says that nearly two-thirds (65%) of the most "expensive" traffic (i.e., the most resource-intensive in terms of the type of content consumed) comes from bots. Yet only 35% of overall page views come from these bots. According to Wikipedia, the reason for this disparity is that frequently accessed content is stored closer to the user in its cache, while other, less frequently accessed content is stored farther away in "core data centers," from where it costs more to serve the content. This is the type of content that bots typically seek out.

"While human readers tend to focus on specific (often similar) topics, crawler bots tend to 'batch read' large numbers of pages and visit less popular pages," Wikipedia wrote. "This means that these types of requests are more likely to be forwarded to core data centers, making them more expensive for our resources."

All in all, the Wikimedia Foundation’s Site Reliability Team has to spend a lot of time and resources blocking bots to avoid disruption to regular users. And that’s before considering the cloud costs the Foundation faces.

In fact, it represents part of a rapidly growing trend that is threatening the existence of the open internet. Last month, software engineer and open source advocate Drew DeVault complained that AI crawlers were ignoring “robots.txt” files designed to protect against automated traffic. And “pragmatic engineer” Gergely Orosz last week complained that AI crawlers from companies like Meta were increasing bandwidth demands on his own projects.

While open source infrastructure is particularly at the forefront, developers are fighting back with “ingenuity and a vengeance.” Some tech companies are also doing their part to address the problem — Cloudflare, for example, recently launched AI Labyrinth, which uses AI-generated content to slow down crawlers.

However, this is more of a cat-and-mouse game that could ultimately force many publishers to hide behind logins and paywalls — something that would be detrimental to everyone using the web today.

From Chinese Industry Information Station

<<:  Huawei Band B5 review: Breaking the awkward positioning, dual-purpose for both business and sports

>>:  360WiFi6 whole-house router review: not only can it run full bandwidth in the bathroom and balcony, but it is also a network security manager

Recommend

Five Thanksgiving marketing tactics

The annual Thanksgiving is coming again! Is the p...

My understanding of user activation and practical communication

Activation is to allow users to use the core func...

How much does it cost to rent a cloud server for a year?

Now more and more companies are using cloud hosti...

Give you a complete set of community operation solutions

The value of WeChat groups, QQ groups, and Zhangh...

How to attract 100,000 users through H5?

Attracting new users to an APP has always been a ...