Bot Wars 4.0: Protecting Digital Collections from AI

To understand and generate human-like interactions, AI—particularly large language and generative models—must be trained on vast volumes of data. Fueling this training process are bots, constantly scouring the web for content to ingest—ushering in what is now the fourth wave of the ongoing Bot Wars.

Rich in structured, text-heavy data, digitized historical collections unfortunately represent an ideal target—an abundant reservoir these bots, with their insatiable appetite, are eager to exploit.

Before diving into this latest chapter—and how the current wave is being fought on the front lines—let’s briefly revisit how the Bot Wars have evolved over time.

The First Wave: Dawn of Indexing

It began with the rise of search engine indexing. As digitized collections became publicly accessible online, indexing offered a powerful way to boost visibility—furthering the goal of broader public access. However, some unintended consequences soon emerged—most notably, the unauthorized indexing of data and content that was never intended to be publicly accessible. In many cases, open indexing made substantial portions of material directly available through search results, diverting traffic away from the host platforms. This diminished opportunities for meaningful engagement, as organisations had limited control over how their content was displayed or the context in which it was consumed.

To manage early concerns around uncontrolled indexing, many organisations—including Veridian—implemented robots.txt files to define which content could be accessed or indexed. To their credit, most major search engines respected these protocols, empowering organisations to share knowledge more widely without sacrificing context or control.

Even today, robots.txt remains a valuable tool in our broader strategy—offering a simple yet effective way to manage access at the source.

Second Wave: Exploitation and Commercialization

Following the dawn of indexing, the next phase of the Bot Wars involved unauthorized scraping for commercial exploitation. In our industry, this was led by rogue genealogy websites, where bots—disguised as regular user traffic—began ingesting issue- and page-level PDFs.

In response to this wave, Veridian introduced technical safeguards such as reCAPTCHA and restricted PDF access to specific user groups. At the same time, we strengthened licensing agreements and terms of use to clearly define permissible access and usage. Together, these measures proved effective in slowing automated scraping and restoring a level of control over how content was accessed and reused.

Third Wave: Image-Focused Exploitation and Commercialization

With safeguards in place to restrict unauthorized access to PDFs, the third wave of the Bot Wars brought a shift in tactics—this time targeting images. A new generation of bots began harvesting the individual image tiles used to render digitized content in on-screen viewers. These tiles, part of the digital display layer, became attractive targets due to their accessibility and high-resolution quality.

Bots bypassed standard protections by mimicking legitimate user behavior and systematically downloading tile sequences to reconstruct full-page images offline. In response, our team developed custom mechanisms such as server-side request validation, throttling, and tailored logic to verify the legitimacy of requests to access our collections.

The New Frontline: AI and Large Language Models

Now we face an unprecedented fourth wave. The rise of AI—particularly Large Language and Generative Models—has reignited aggressive bot activity, this time with new intent and greater sophistication. Bots designed to harvest content for AI training have emerged in force. While some mimic the behaviour of legitimate indexing bots and adhere to robots.txt directives, others deliberately obscure their identities, blending in with regular user traffic in an attempt to gain unauthorized access.

To manage these risks, many organisations actively monitor access patterns, looking for anomalies that may signal scraping activity. Yet as these operations become increasingly distributed—sometimes leveraging thousands of IP addresses, each requesting small fragments of content to avoid detection—traditional defences like IP blocking are less effective.

The impact of mass scraping goes well beyond copyright concerns—we’ve seen firsthand the significant strain these bots can place on servers. This not only degrades the performance of digital collections but also disrupts the experience for legitimate users. The increased server load often leads to higher infrastructure costs and added support effort to diagnose, mitigate, and recover from these attacks.

Fortunately, infrastructure providers such as Cloudflare, Blackwall (previously BotGuard), Amazon, and Google provide advanced bot detection and mitigation solutions. These tools apply behavioural analytics at scale and some are even turning AI against itself—deploying decoy content or dynamically generated page environments designed to trap and delay bots, consuming their resources without exposing genuine content.

At Veridian, we’re working closely with leading infrastructure partners to safeguard our customers. While the landscape is changing fast, we’re continuously adapting our protections to ensure your collections remain secure—without compromising the Veridian features you’ve come to know and love.