When Data Governance Meets Data Engineering: Optimizing Microsoft Purview with SparkLin for Automated Lineage

Tom Van Breuseghem

21 Jan 25

5min read

Data Engineering

Data Governance

In the fast-evolving world of data, organizations are striving to unlock the full potential of their digital assets. A recent project with a leading healthcare organization exemplifies this journey.

Unlocking Data Potential with Microsoft Purview

Our client, a leader in its field, operated within a fragmented and complex data landscape shaped by years of custom application development. As part of their transition to the cloud, they chose Microsoft Purview as their go-to Data Governance tool. Its robust capabilities in building data catalogs, maintaining business glossaries, and assigning clear roles like data stewards and owners provided a strong foundation for systematic governance. This approach fostered collaboration, created a shared understanding of data assets, and ensured compliance in a highly regulated industry.

The Problem: Navigating Data Lineage Limitations

Despite these advantages, the client faced a critical limitation: Purview's inability to automate data lineage for Spark code running in Synapse Analytics notebooks. As Purview only supports lineage for Synapse's 'Copy data' and 'Data Flow' capabilities, the lack of automated lineage left significant portions of the data lifecycle undocumented.

Data lineage is more than a technical feature; it is a cornerstone for understanding how data flows, transforms and integrates across systems. For organizations managing data products in the cloud, lineage is crucial for troubleshooting, maintaining data quality and adapting to evolving business needs. In an era where AI and advanced analytics drive innovation, the ability to trace and trust data pipelines has become indispensable. For this client, addressing the gap in automated lineage was not just a requirement, it was a necessity.

Bridging the Gap with SparkLin - Datashift's approach

Recognizing the importance of comprehensive data lineage, we sought a solution to complement Purview's capabilities. Enter SparkLin, an open-source solution specifically designed to capture and visualize end-to-end data lineage in Spark environments like Azure Synapse.

By integrating with Microsoft Purview, SparkLin delivers detailed lineage insights that enhance governance, transparency and audit readiness. This functionality is critical for impact analysis, debugging, and diagnostics, ensuring robust, compliant data operations. SparkLin fills Purview's functionality gap by ensuring that all data transformations are accurately tracked and documented.

This project exemplifies Datashift's collaborative approach, where data governance experts and data engineers work hand in hand to deliver the best outcomes. By refining and updating SparkLin?s codebase, we ensured seamless integration into their data governance framework, resulting in a robust and efficient solution.

How SparkLin Works: Tapping into Azure's Ecosystem

SparkLin leverages several Azure components to extract and process data lineage from Synapse Spark notebooks. The process begins with the Open Lineage Spark Listener, which is initiated in the Synapse cluster to capture lineage data during runtime. This data is emitted in JSON format, which is then sent to the HTTP endpoint of the first function app. This function app will analyse the JSON file and store it into a blob container if it contains relevant information about the lineage.

Once stored, a second Function App is triggered which executes processing workflows whenever new blobs are detected. At this stage, SparkLin's core Parser converts the lineage information into a format compatible with Microsoft Purview. Finally, the data is pushed to both Azure Table Storage and Purview, where it can be queried for deeper analysis or visualized directly in Purview.

SparkLin meets Purview: A Unified Data Lineage Solution

By integrating SparkLin with Purview, we created a comprehensive data governance framework for our client. SparkLin's automated lineage feature bridged a critical gap. This integration delivers clear, visual insights into data pipelines, including transformation, derived columns and join conditions.

The enhanced transparency allows teams to debug, analyze, and manage their data pipelines efficiently. Moreover, the client can now confidently roll out new data products, navigating their previously fragmented data landscape with greater clarity. This integration has accelerated their journey towards data maturity, enabling them to innovate with trust and efficiency.

Why Datashift?

At Datashift, we bring together the best of both worlds: deep expertise in data governance and hands-on experience in data engineering. This unique combination allows us to tackle complex challenges, whether it's optimizing tools like Microsoft Purview or building custom solutions like SparkLin. For organizations looking to unlock the true potential of their data, we are the partner who delivers both strategy and execution.

Interested in how Datashift can help transform your data landscape? Visit datashift.eu for more insights and success stories.

SparkLin reference: https://techcommunity.microsoft.com/blog/microsoftsecurityandcompliance/end-to-end-data-lineage-from-spark-big-data-environment/3721358

Subscribe to our newsletter

Read, learn, adapt, grow.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Blog

Keep reading

Eager to learn more? No worries: we’ve got you covered.

View all

Let’s make your data move

Unlock the full power of your data with Datashift’s end-to-end Data & AI expertise. Contact us and we’ll show you how your data can fuel smarter decisions, reduce costs, and grow your business.

Get in touch

Join our newsletter

Subscribe, read, learn, adapt, grow.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Our approach

Case studies

Unlocking Viewer Insights with a Scalable Data Platform

Building Practical AI Governance at KBC with Collibra

Boosting Operational Efficiency: Reshaping Workplace Prevention with AI

Wabi-Sabi Data Governance: Why Good Enough Data Is Your Competitive Advantage

Collibra Data Citizens on the Road '26

When Data Governance Meets Data Engineering: Optimizing Microsoft Purview with SparkLin for Automated Lineage

Unlocking Data Potential with Microsoft Purview

The Problem: Navigating Data Lineage Limitations

Bridging the Gap with SparkLin - Datashift's approach

How SparkLin Works: Tapping into Azure's Ecosystem

SparkLin meets Purview: A Unified Data Lineage Solution

Why Datashift?

Subscribe to our newsletter

Keep reading

Let’s make your data move

Join our newsletter

Our approach

Case studies

Unlocking Viewer Insights with a Scalable Data Platform

Building Practical AI Governance at KBC with Collibra

Boosting Operational Efficiency: Reshaping Workplace Prevention with AI

Wabi-Sabi Data Governance: Why Good Enough Data Is Your Competitive Advantage

Collibra Data Citizens on the Road '26

Subscribe to our newsletter

Keep reading

From data chaos to business impact: how to tap into your invisible goldmine

The OpenClaw Symptom: Why Your Enterprise AI Strategy is Failing (and what to do about it)

From hidden experiments to confident Gen AI adoption

Let’s make your data move

Join our newsletter

Let’s make your data move