Veridian News

Do Retrieval-Augmented Generation (RAG) Models Risk Breaking Copyright?

4 August 2025

In March 2025, the U.S. District Court ruled that the New York Times’ copyright infringement claims against OpenAI (what powers ChatGPT) could proceed. This decision could signal a major shift in how copyright law is applied in the age of artificial intelligence—setting important precedents for how copyrighted materials may be used in developing generative and large-scale language models.

This raises a timely question: could other types of AI models—specifically Retrieval-Augmented Generation (RAG) models—also pose similar copyright concerns?  

A quick media recap…

In December 2023, the New York Times filed a lawsuit against OpenAI (and, by association, Microsoft), alleging that millions of its articles were used to train large language models (LLMs) without permission. The core of the complaint is that these generative models absorbed and reproduced NYT’s original journalism—including news articles, how-to guides, investigative reporting, and opinion pieces—directly undermining the service provided by the newspaper.

As a result, attention has turned to how other AI models, including RAG-based systems, handle source material—especially as they gain traction among libraries, archives, and cultural institutions looking to improve discovery across digitised collections, including historical newspapers.

Should we be concerned about RAG models?

In short: We don’t believe so—and here’s why. Unlike traditional generative models or LLMs, RAG models do not train on or ingest external documents into the model itself. Instead, they retrieve relevant content in real time from an internal vector database using semantic search. This retrieved content is then passed to a language model, which generates a response based solely on that material. The original source documents remain untouched and separate from the model’s training data.

Here’s how it works:

A user submits a query:
e.g. “What was the impact of the 1918 influenza pandemic on New Zealand’s economy?”  
The model converts the query into a vector (a list of numbers) that represent the semantic meaning of the query.  
The system searches a vector database constructed from the documents to find those with vectors most similar to the query. 
Relevant documents from the collection that best match the query “What was the impact of the 1918 influenza pandemic on New Zealand’s economy?” are retrieved and presented to the user.  
A LLM generates a human-like response to the original question, based only on the documents retrieved in the step above.

How does RAG mitigate copyright issues?

LLMs typically generate responses based on the vast amount of data they were trained on—which can lead to the reproduction of copyrighted material. In contrast, when an LLM is used within the generation step of a RAG process, it creates responses dynamically using only the content retrieved from your collection. This means it doesn’t rely on or access external sources, copyrighted or otherwise—helping to mitigate copyright risks while keeping responses grounded in your own collection.
Since RAG retrieves specific information—based on the user’s query—from a vector database, the model can cite the original source content used to generate the response. This not only enhances transparency but also supports the ethical use of copyrighted material. 
Building on the points above, one of the key advantages of RAG is its ability to offer greater visibility into the sources behind generated responses. This allows developers within organisations using these models to more effectively track and manage where the information originates, helping to avoid potential copyright infringement altogether.
RAG models can be configured to access only approved or licensed databases and sources, adding an extra layer of control and protection against copyright infringement. *Additionally, they can support source-level access controls, such as content filtering, access logs, and query auditing–enabling organisations to track how often specific source material is accessed, and by whom. This helps support compliance with contractual obligations, especially for organisations with tiered access settings (e.g. some collections restricted to internal researchers only).

In a landscape where AI and copyright are increasingly intertwined, RAG represents a promising and responsible pathway for AI-powered discovery. By enabling natural-language queries and semantic search, RAG enhances access to archival material—while preserving the integrity and respecting the copyright of original sources.

That’s why at Veridian, we’re focused on developing AI-integrated features that run locally within our collections and remain entirely under our control.

Veridian News

Do Retrieval-Augmented Generation (RAG) Models Risk Breaking Copyright?

A quick media recap…

Should we be concerned about RAG models?

How does RAG mitigate copyright issues?

Our newsletter