Wikimedia Foundation: AI crawlers cause Wikimedia Commons bandwidth demand to surge 50%

Wikimedia Foundation: AI crawlers cause Wikimedia Commons bandwidth demand to surge 50%

The Wikimedia Foundation, the management organization of Wikipedia and more than a dozen other crowdsourced knowledge projects, said on Monday that bandwidth consumption for downloading multimedia from Wikimedia Commons has surged 50% since January 2024.

The reason stems not from growing demand from knowledge-hungry humans but from automated, data-hungry crawlers that want to train artificial intelligence models, the company wrote in a blog post on Tuesday.

“Our infrastructure is built to withstand sudden surges in traffic from humans during high-profile events, but the volume of traffic generated by bots is unprecedented and comes with increasing risks and costs,” the post reads.

Wikimedia Commons is a freely accessible repository of images, video and audio files that are available under open licenses or are in the public domain.

Digging deeper, Wikipedia says that nearly two-thirds (65%) of the most "expensive" traffic (i.e., the most resource-intensive in terms of the type of content consumed) comes from bots. Yet only 35% of overall page views come from these bots. According to Wikipedia, the reason for this disparity is that frequently accessed content is stored closer to the user in its cache, while other, less frequently accessed content is stored farther away in "core data centers," from where it costs more to serve the content. This is the type of content that bots typically seek out.

"While human readers tend to focus on specific (often similar) topics, crawler bots tend to 'batch read' large numbers of pages and visit less popular pages," Wikipedia wrote. "This means that these types of requests are more likely to be forwarded to core data centers, making them more expensive for our resources."

All in all, the Wikimedia Foundation’s Site Reliability Team has to spend a lot of time and resources blocking bots to avoid disruption to regular users. And that’s before considering the cloud costs the Foundation faces.

In fact, it represents part of a rapidly growing trend that is threatening the existence of the open internet. Last month, software engineer and open source advocate Drew DeVault complained that AI crawlers were ignoring “robots.txt” files designed to protect against automated traffic. And “pragmatic engineer” Gergely Orosz last week complained that AI crawlers from companies like Meta were increasing bandwidth demands on his own projects.

While open source infrastructure is particularly at the forefront, developers are fighting back with “ingenuity and a vengeance.” Some tech companies are also doing their part to address the problem — Cloudflare, for example, recently launched AI Labyrinth, which uses AI-generated content to slow down crawlers.

However, this is more of a cat-and-mouse game that could ultimately force many publishers to hide behind logins and paywalls — something that would be detrimental to everyone using the web today.

From Chinese Industry Information Station

<<:  Huawei Band B5 review: Breaking the awkward positioning, dual-purpose for both business and sports

>>:  360WiFi6 whole-house router review: not only can it run full bandwidth in the bathroom and balcony, but it is also a network security manager

Recommend

How to operate short videos? How to promote the operation of short videos?

To do short video marketing, we need to understan...

How to operate live broadcast? Master these 3 key points!

I believe everyone is familiar with the term &quo...

Can AI change the way science is done?

Artificial intelligence (AI) is shaping the scien...

Shocking! Samsung's $2,000 foldable phone broke after two days?

After the mobile phone battery explosion incident...

Mom, mom, mom, why is my body feeling so numb?

Whenever we came out of the bathroom, staggering ...

The traffic dividend has passed, how can content products retain users?

Xiao Y, a post-95s youth, receives many enthusias...

Nine blogs to watch for hybrid mobile app developers

[51CTO.com Quick Translation] The rapid populariz...

Breaking down the planning logic of big promotion membership activities!

1. Case Study Xiao Ming has been signing in at a ...

Tencent technical tips! How to make a terrifying HTML5 page

The whole text is full of valuable information an...

Professor Bao Aimin: 20 Lectures on New Knowledge in Brain Science

Professor Bao Aimin: 20 Lectures on New Knowledge...

Toutiao advertising strategy and channel characteristics

Why are my ads always performing poorly? I think ...