Using Workflows to Reliably Work with Data and AI

Just like in many other areas, GenAI has great potential to accelerate and improve how we work with data. However, it also comes with new challenges. In addition to the typical governance and security issues, not everything AI creates is reliable and not every insight generated by AI can be trusted.

Working with AI is like adding a new colleague to the team, a colleague who seems to know it all, is super productive, but one whose work we still need to supervise and occasionally correct. This is very similar to our development team, by the way. When they write code with Copilot support, we count on them to validate everything that was written by an AI. But how does our data science team validate what their new AI colleague contributed?

In data science, working collaboratively, even without AI, has been an issue for many years, but has been accepted as a necessary evil. The existing toolsets date back to the historical origins of data science in CS or math, they are still mostly incompatible with one another, and make collaboration very difficult.

How can we make sure everybody who works with data can collaborate with their colleagues and now also benefit from the massive opportunities AI presents? At the same time, how do we ensure what is created is reliable, trustworthy, and adheres to corporate governance requirements? It is time to admit that data workers need their own environment and stop borrowing those from other disciplines.

Workflows: A common language

Since data and AI experts come from diverse backgrounds, they often bring along their favorite tool – and that tool is often a programming language: SQL for data engineers, Python for data scientists, R for statisticians – to name just the three most prominent examples. And let us not forget all those Excel Macro wizards out there.

All of these languages have tried to embrace the other disciplines, but for a data science practice that truly covers the entire aspect of “data stuff”, people must master all of them and then some. Now, with the rise of GenAI, we are also throwing inhouse and cloud AI solutions into that toolbox. We need to start using tools that are actually designed for this type of task – tools that support collaborative work with data and AI.

It may help to briefly pause and contemplate what a data worker actually needs to know and control to do their work. Do they really need to understand in which programming language one of the tools they use is implemented? Do they need to dive into the actual implementation? In most cases, they don’t. Using coding as a tool to perform data science often boils down to lining up a series of library calls – but is this truly necessary? Why do we limit them to libraries that are available in only one language? And why would we force them to code in a programming language that was designed for different purposes?

The answer is: we don’t.

It is not that working with data and AI is not a programming activity – but it should not need to be a coding activity. There’s a more suitable way to define a chain of library calls or calls to a GenAI API: Allow people to directly manipulate the flow of the data through those tools, abstracting away from the actual implementation (and used language or AI service) underneath the hood.

Take a very simple example: Let’s say I want to read and connect to a database, read a table, combine this with data with an Excel file, and then build a regression model. Do I really need to worry about how the SQL code for the database is structured and which library is used to read that Excel file? The Excel reader library is likely written in a different language than the regression learner algorithm, so I need to interface the database output with the data representation of those two languages, too. But I don’t care about those details. What I do care about is what a regression analysis actually does – otherwise I won’t be able to interpret the coefficients or use the resulting regression model properly.

Or, to put it briefly: Data workers need to know what their tools are doing – not how they are doing it. And, quite frankly, that’s complicated enough. There are tons of different ways to process and analyze data out there. New interfaces for AI services pop up on a daily basis. Forcing our data teams to worry about implementational and interface details will bias them towards using the tools that are available in their favorite language and not pick the best tool for the data task at hand.

If we abstract all those different languages and their libraries up to the level of the dataflow, we are doing what computer scientists have done for decades: We abstract away the details that don’t matter and expose only the pieces that need to be controlled to the programmer. Because, in the end, those workflows are also programs. A good programming environment allows programmers to focus on what their job is and abstracts away the – for their job unnecessary – details. This way, workflows turn into the common language for me and my team to collaborate with AI and, together, build complex data workflows.

Workflows: A common standard

Working at the appropriate level of abstraction by using workflows has other advantages, though. We can use our workflows to easily collaborate with our different experts; we go back to our workflow later to explain to others what’s happening to the data and how we came up with those results.

With AI, this aspect has become even more important. If I can delegate parts of my work to an AI, I, just like any other programmer out there, am well advised to carefully check what the AI built. Does it really do what I asked it to do, or did it find a “creative” but totally wrong solution? If AI builds a part of a workflow, it’s much easier for me to understand and validate what was suggested and quickly identify flaws.

And as a nice side effect, workflows also serve as a great basis for building a repository of knowledge that others can use as blueprints for their own work and don’t need to start from scratch. Once AI learns from this workflow repository, it will actively leverage community wisdom to suggest appropriate solutions, and maybe here and there also point to atypical parts in my workflows that look “odd”. Maybe I was just truly creative there, but more likely, I was doing something wrong.

The transparency of the workflow representation has other benefits as well. Even before the arrival of GenAI, workflows were often used for documentation and even auditing purposes. How exactly was this financial report created? How did the decision to reject this loan application come about? Why was the production process altered 4 months ago and what can we do to not make this mistake again going forward? These and other auditing questions can be addressed well using workflows as the transparent documentation of what was done to the data.

This validation aspect of workflows becomes even more important when we let AI produce the output entirely.

If we have an AI system that generates insights from data directly, in many instances we will need to be able to understand how those insights were actually created. Trusting a black box AI system that’s prone to hallucination is not a good idea when we ask it to produce quarterly tax statements, or to forecast how much revenue a business needs to stay afloat.

Using workflows, AI can actually show the reasoning process and use the workflow to explain and allow us to validate how it came to those conclusions. This way, workflows turn into a standard to create, document, and validate what was done to the data.

Workflows: Working reliably with data and AI

Working with data collaboratively, with colleagues and AI, requires the right environment so that everybody speaks the same language. Workflows provide the appropriate level of abstraction to allow us to focus on the complexity of working with data without being distracted by technical details that are not relevant for the actual data work. Workflows provide a transparent mechanism for collaboration and a trustworthy basis for documentation and auditing, making working with data and AI reliable.

Using Workflows to Reliably Work with Data and AI

Workflows: A common language

Workflows: A common standard

Workflows: Working reliably with data and AI

You might also like