mgr's weblog

Archives for July 2024

AI web scraping running wild

July 26, 2024, Miscellaneous
Last edited on July 26, 2024

I have been running https://www.poezio.net for more then 20 years now. It is a poetry database written for the hundreds of Esperanto translation of my father, at the same time it is one of my first Common Lisp programs. Output can be arranged dynamically in multi-column PDF documents generated by TeX with support for many languages.

It is been running almost unchanged for many years. Originally written in 2003, with one overhaul in 2010 adding new CSS and a logo to give it a fresh look. Occasionally, I had to move it to a new host, then into a virtual server, which I had to convert again. Now it's behind some proxy servers but still the same behind the scene, unchanged for more then a decade.

In the beginning of 2010 performance degraded, as Google started to crawl it constantly for new combinations of translations and have them exported as PDF. As the different language version can be dynamically combined, that page looked huge for a crawler. So I had to throttle Google a bit and change the logic so that the UI would not allow to repeat the same translation multiple times. That was it. For more then 14 years.

Until now.

In April 2024 this completely changed. The page is constantly overloaded. And it is not only one bot, Googlebot/2.1, but: AhrefsBot/7.0, Amazonbot/0.1, Applebot/0.1, bingbot/2.0, Bytespider, ClaudeBot/1.0, DotBot/1.2, DuckDuckBot/1.1, Googlebot/2.1, PetalBot, SeekportBot, SemrushBot/7~bl, serpstatbot/2.1, YandexBot/3.0, CCBot/2.0, ChatGPT-User/1.0, MojeekBot/0.11, Mail.RU_Bot/2.0, serpstatbot/2.1, DataForSeoBot/1.0, GPTBot/1.2.

I had to cater for it on 2024-05-14, and again on 2024-07-24 adding more bots. It was a much longer period before, with the last modification in 2010-01-20. And is still not in a good state. Every day I get mails from my monitoring that the availability is reduced.

This only this little web site. Image the impact worldwide. This constant scraping must be truly massive and cause an immense power consumption for all servers and the networking around the world.

Pushing the logic to the data – Running Dydra's revisioning algorithm within RonDB's data nodes

July 23, 2024, Lisp
Last edited on July 26, 2024

July 23, 2024. News release.

Datagraph GmbH, Berlin. – We are working on ways to distribute not just data storage, but also query processing. Dydra, our graph database, can use a RonDB NDB Cluster as storage backend for large repositories of billions of triples. We discussed with Mikael Ronström of RonDB and Hopsworks how we could improve our use of RonDB and its advanced features, and proposed to investigate how extensions to the RonDB interpreted code language could make it possible to move our revision visibility test from the core of our graph database system to an implementation that runs in RonDB's data nodes. That would distribute the processing load and reduce the transferred data.

That is, we could push the test to the data instead of having to pull all data to the test.

In Dydra, all data can be revisioned. For that, each statement can have a vector of revision ordinals associated with it to describe its visibility, and thus its full history. Now, if the revision visibility test ran in the data nodes already, queries that involve scans over the data will fetch only those statements that match the specified revision and the visibility information will not leave the data nodes, greatly reducing the data that has to move.

In recent weeks, Mikael not only implemented the minimal set of our proposal but took it as a opportunity to overhaul the interpreted code language greatly, adding dozens of new commands, changing the instruction format to allow for even more commands, supporting both full and partial reads of data columns into memory and than copying out parts of that data into registers of the register machine, and more. This will allow for interesting new optimizations of many applications based on RonDB.

Max-Gerd Retzlaff of Dydra and Datagraph implemented a nifty little compiler on top of the interpreter that allows to write NDB interpreted code for RonDB in more high-level Lisp code rather than NDB interpreted code (NDB IC) instructions for the virtual register machine that runs the NDB IC instructions. So you can write your logic with high level conditionals such as IF, WHEN and COND (which is Lisp's IF..ELSEIF..ELSE construct) rather then having to define labels and using branch and jump instructions which are more in the fashion of writing assembly.

At the same time, this Lisp NDB IC Compiler can not only compile to NDB interpreted code instructions but also to regular Common Lisp. This allows for testing and debugging of algorithms within the Lisp development image with all its tooling available, and to shift over to NDB IC instructions only when new code passed all tests.

Max used the new compiler to reimplement, and test, Dydra's revision visibility algorithm, which is based on binary search with a number of corner cases, to directly work on the visibility data stored as a VARCHAR column in RonDB.

Data that used to be opaque to RonDB and had needed to be retrieved from the data nodes and interpreted by the Dydra query processor, is now analyzed by NDB interpreted code within the RonDB cluster already.

More information and detailed performance testing to follow. Availability

The Lisp NDB IC compiler is part of a development branch of CL-NDBAPI, our Open Source Common Lisp bindings to the C++ NDB API of RonDB, available at https://github.com/datagraph/cl-ndbapi.

This branch currently bases on the preview version rondb-22.10.97 of RonDB, that is made for the pull request "RONDB-671: Add a set of new instructions to interpreter making it more complete" at https://github.com/logicalclocks/rondb/pull/472. These changes are scheduled to be in the RonDB 24.10 development tree in late August.

Our work will be made available in CL-NDBAPI's repository at https://github.com/datagraph/cl-ndbapi when the work on of the new RonDB branch and in turn our development version has been stabilized.

Select a Theme:

Basilique du Sacré-Cœur de Montmartre (Paris) Parc Floral de Paris Castillo de Santa Barbara (Alicante) About the photos

Entries: