mgr's weblog

Miscellaneous Archives for July 2024

AI web scraping running wild

July 26, 2024, Miscellaneous
Last edited on July 26, 2024

I have been running https://www.poezio.net for more then 20 years now. It is a poetry database written for the hundreds of Esperanto translation of my father, at the same time it is one of my first Common Lisp programs. Output can be arranged dynamically in multi-column PDF documents generated by TeX with support for many languages.

It is been running almost unchanged for many years. Originally written in 2003, with one overhaul in 2010 adding new CSS and a logo to give it a fresh look. Occasionally, I had to move it to a new host, then into a virtual server, which I had to convert again. Now it's behind some proxy servers but still the same behind the scene, unchanged for more then a decade.

In the beginning of 2010 performance degraded, as Google started to crawl it constantly for new combinations of translations and have them exported as PDF. As the different language version can be dynamically combined, that page looked huge for a crawler. So I had to throttle Google a bit and change the logic so that the UI would not allow to repeat the same translation multiple times. That was it. For more then 14 years.

Until now.

In April 2024 this completely changed. The page is constantly overloaded. And it is not only one bot, Googlebot/2.1, but: AhrefsBot/7.0, Amazonbot/0.1, Applebot/0.1, bingbot/2.0, Bytespider, ClaudeBot/1.0, DotBot/1.2, DuckDuckBot/1.1, Googlebot/2.1, PetalBot, SeekportBot, SemrushBot/7~bl, serpstatbot/2.1, YandexBot/3.0, CCBot/2.0, ChatGPT-User/1.0, MojeekBot/0.11, Mail.RU_Bot/2.0, serpstatbot/2.1, DataForSeoBot/1.0, GPTBot/1.2.

I had to cater for it on 2024-05-14, and again on 2024-07-24 adding more bots. It was a much longer period before, with the last modification in 2010-01-20. And is still not in a good state. Every day I get mails from my monitoring that the availability is reduced.

This only this little web site. Image the impact worldwide. This constant scraping must be truly massive and cause an immense power consumption for all servers and the networking around the world.

Select a Theme:

Basilique du Sacré-Cœur de Montmartre (Paris) Parc Floral de Paris Castillo de Santa Barbara (Alicante) About the photos

Entries: