How OpenLineage helps you shift left in your data governance

Abhishek Mishra
3 min readMar 15, 2024

Okay, that’s a mouthful for a title 😶
I hope it does justice to what I am going to say here. If you already know about OpenLineage and you have implemented it, the write up title should be enough I believe 😅
That’s like a tl:dr; of a tl:dr; and that’s it.

But of course, you are welcome to stay on with me and explore what OpenLineage can do for you.

OpenLineage is… a standard at the core and in entirety. It’s an API standard to ship metadata so that you can build a good lineage graph.

How does it work?

It pretty much gives you a pattern to ship 🛳️ your lineage metadata from the source systems (especially that deal with transformation). Let’s look at some boxes and arrows.

High-Level Architecture Diagram of OpenLineage

As we can see, systems like Airflow, Spark, dbt can ship metadata to a repository that can process and store OpenLineage API calls. Going just a bit deeper, each time there is a “run” 🏃🏼‍♂️you can ship an event to the repository where you essentially capture the input and the output (the before the after state in the transform. Something like this,

Taken directly from the OpenAPI spec for OpenLineage

You can store the shipped metadata in a repository (with Marquez being the default implementation) and later, you can query it via API end points to build lineage graphs, answer questions about lineage etc.

So why is this cool or why does it matter?

The obvious

Of course, OpenLineage helps you ship the lineage metadata in a standardised fashion which means you can run parsing, analytics, graph generation much easier.
Without a common standard you will have each of Airflow, Spark, dbt send you metadata in maybe different ways, adding additional burden to write translators, adapters.
Add more source systems to the mix and compound the problem further 🤯

The above is OpenLineage’s claim to fame. And that should be a bit obvious.

And now 🥁🥁

The not so obvious

The part that isn’t extremely obvious but yet it sits right in front of us, is the fact that OpenLineage is enabling you to ship the lineage metadata from the source, near the time of the actual transformation. This means a couple of things,

1- The metadata is super fresh, always
You have the opportunity to build the lineage graph almost near real time (if you use Kafka to ingest events, you can truly make your backend consume events in near real time and OpenLineage supports this pattern)

2- The metadata is captured from the source ✅ and NOT from some downstream analytics system.
If there is anything that data mesh taught us, it is that shifting the burden of analytics closer to the domains is better because context is not lost (which can happen if you reverse engineer lineage AFTER the analytics has arrived). This is truly how you are shifting the lineage building left in the parlance.

There is of course more to OpenLineage than this, but I believe this crucial benefit comes to light as we have more mature data transformation journeys planned.

--

--

Abhishek Mishra

Product manager- building a home for data teams @ Atlan. Data & agile enthusiast. Ex- Thoughtworker. Wrote a book. Behind 9 products gone live!