How to Summarize EU Regulations and Automate PDF Reports with GenAI and KNIME

Professionals in legal, compliance, and regulatory fields often deal with lengthy and complex EU regulations. Reviewing these documents manually is time-consuming and difficult to scale. In this article, we show how to build KNIME workflow that uses large language models (LLMs) to summarize EU regulations and automatically distribute the insights in a well-formatted PDF report via email.

KNIME Analytics Platform is an open-source data science tool that allows you to build solutions using visual workflows, eliminating the need for coding expertise.

This blog series on Summarize with GenAI showcases a collection of KNIME workflows designed to access data from various sources (e.g., Box, Zendesk, Jira, Google Drive, etc.) and deliver concise, actionable summaries.

Here is a 1-minute video that gives you a quick overview of the workflow. You can download the example workflow here, to follow along as we go through the tutorial.

Let’s get started.

Automate EU regulation summarization and distribution

Our goal is to simplify the process of analyzing EU regulations by automating content extraction, summarization, and reporting. Thereby, enabling professionals to quickly access structured summaries of complex regulations without manually sifting through lengthy texts.

We can do this in three steps:

Access and parse EU regulations as PDF files
Check texts and summarize them using an LLM
Compile summaries into a PDF report and send it via email

This provides professionals with an efficient way to stay informed about relevant regulatory changes, without manually reviewing long documents.

This workflow automates the summarization of EU regulations using KNIME

Step 1. Access data: Import and chunk lengthy regulation texts

In the first section, the workflow accesses EU regulations, parses the content and splits the text into smaller chunks.

We begin by accessing and parsing two EU regulations (2015/2120 and 2022/612) on roaming on public mobile communication networks stored as PDF files. The Tika Parser node reads in and converts the PDF files into machine-readable text, extracting both the document content and metadata.

The “Extract Title” metanode isolates regulation names from the file paths and converts the text into a Document-type object. In this way, we can inspect the full text of each regulation using the Document Viewer node and also easily search for specific terms (e.g., the word “Article”).

Viewing the EU Regulation 2015/2120 on roaming on public mobile communication networks and searching a specific term with the Document Viewer node.

Since the documents are long, we split them into smaller chunks to avoid exhausting the LLM’s context window.

The Text Chunker node divides the full text into overlapping segments, for example, 3,000 characters per chunk with 300 characters of overlap. Overlapping is important for maintaining context across boundaries and generating coherent summaries. Without overlaps, key details at chunk boundaries could be missed, leading to fragmented or misleading summaries.

Next, in the “Prepare Chunks” metanode, we format text chunks into numbered inputs for each document. Each chunk is labeled (e.g., “Chunk 1: 26.11.2015 EN Official Journal of the European Union L 310/1 I (Legislative acts) REGULATIONS…”.), ensuring each chunk is a self-contained message, ready for summarization in the next step.

Step 2. Prompt LLM: Summarize regulations with OpenAI’s GPT-3.5-turbo-16k

This section of the workflow prompts OpenAI’s GPT-3.5-turbo-16k in two stages to summarize EU regulations.

To summarize the regulations, we use the KNIME AI extension and select the most suitable LLM for the task, balancing both cost and performance. For example, OpenAI’s GPT-3.5-turbo-16k is a good option for its large context window that allows us to process longer texts and at low cost.

To set up the connection, follow these steps:

Input the OpenAI API key in the Credentials Configuration node
Authenticate to the service with the OpenAI Authenticator node
Use the OpenAI Chat Model Connector node to connect to the GPT-3.5-turbo-16k model

Despite the model choice, prompting an LLM with lengthy texts requires a smart workflow design to avoid exhausting the model’s context window. In particular, we use a two-stage summarization approach:

Stage 1: It generates summaries of the individual text chunks
Stage 2: It uses these summaries as context to generate a final, comprehensive regulation summary.

This method uses tokens efficiently and preserves context across the full document.

Stage 1: Summarize individual text chunks

We create a parameterized prompt containing the regulation name and the overlapping text chunks, and use the LLM Prompter node to query the model. Here’s the prompt we used:

join("\n\n", "You're a legal expert in EU law. Summarize the following chunks for each regulation.
There is some overlap between the chunks to not lose the context:", $["Title"],$["Chunks"])

By summarizing at chunk level first, we ensure that each output is considerably shorter and can be more easily processed in Stage 2.

Stage 2: Generate a comprehensive summary of each regulation

The second query uses the chunk-level summaries to create a final summary for each regulation. That is, we need to ensure that each chunk is not processed independently from the others; instead all chunk summaries should be fed to the model at once, preserving the context.

To do that, we use the Chat Model Prompter node, not to simulate a real chat, but to feed prior chunk summaries as “conversation history.” Practically, this means:

1. In the “Conversation settings”, select chunk summaries in the Message column.

2. In the “Configuration settings”, specify the roles in the interaction (human or AI). To do that, drag & drop the Constant Value Column node and add a column named “Role” with the value "AI", identifying AI as the source of the chucks

3. Add your final summarization prompts in the “System Message” and “New message” fields.

The configuration window of the Chat Model Prompter node.

New message: it provides direct instructions for the final summarization:

I am giving you all the summaries from two different EU regulations on roaming public mobile communication networks. Take them and produce an exhaustive summary for each. Next, for each regulation, extract the main five topics as bullet points and reference article numbers, if pertinent.

System Message: it is similar to a prompt but it’s typically the first message given to the model describing how it should behave, assigning it a persona and clarifying the task (see best practices for prompt engineering):

You are a senior legal expert on European Law. You receive chunks of different EU regulations on roaming on public mobile communication networks as input and are expected to summarize the regulations.

The Chat Model Prompter outputs a new row with the summary and “AI” as the source of it.

The output of the Chat Model Prompter is then post-processed to retain only the AI-generated summaries of the regulations.

Step 3. Deploy results: Create a PDF report and distribute via e-mail

This part of the workflow formats the generated summary into a professional PDF report and distributes it automatically per e-mail.

The final stage involves compiling the summaries into a report and distributing it automatically per e-mail.

To create a static and properly formatted PDF report, we use the KNIME Reporting Extension:

Use the Report Template Creator node to define page size and orientation
In the “Regulation Summaries” component, display summaries using Table View and add hyperlinks to the full regulations for deeper reading
Include headings and design a friendly layout to improve readability.

To distribute the report automatically via email:

Use the Email Connector and Email Sender nodes
Enable the blue port of the Email Sender node to attach the generated PDF report and send it to the desired recipients.

The result: Key insights from EU regulations delivered in your inbox

The final output is a professional and clearly structured PDF report containing concise summaries of the EU regulations, as well as numbered lists of the main topics.

This process ensures fast document overviews, automates regulatory analysis and reduces time spent on manual review.

A preview of the PDF report with the summaries of the EU regulations.

GenAI for Summarization in KNIME

In this article from the Summarize with GenAI series, we explored how KNIME and Generative AI can automate the summarization and distribution of complex and lengthy EU regulations. With KNIME, you can integrate LLMs into visual workflows, without coding and streamline processes, reduce manual review and improve efficiency across industries.

You learned how to:

Access and parse EU regulations PDFs
Summarize them using an LLM
Deploy results into a static report and distribute it via email

Download KNIME Analytics Platform and try out the workflow yourself to see how KNIME can streamline your document summarization tasks.