When Data Governance Meets Data Engineering: Optimizing Microsoft Purview with SparkLin for Automated Lineage
22 January 2025
In the fast-evolving world of data, organizations are striving to unlock the full potential of their digital assets. A recent project with a leading healthcare organization exemplifies this journey.
Unlocking Data Potential with Microsoft Purview
Our client, a leader in its field, operated within a fragmented and complex data landscape shaped by years of custom application development. As part of their transition to the cloud, they chose Microsoft Purview as their go-to Data Governance tool. Its robust capabilities in building data catalogs, maintaining business glossaries, and assigning clear roles like data stewards and owners provided a strong foundation for systematic governance. This approach fostered collaboration, created a shared understanding of data assets, and ensured compliance in a highly regulated industry.
The Problem: Navigating Data Lineage Limitations
Despite these advantages, the client faced a critical limitation: Purview’s inability to automate data lineage for Spark code running in Synapse Analytics notebooks. As Purview only supports lineage for Synapse’s ‘Copy data’ and ‘Data Flow’ capabilities, the lack of automated lineage left significant portions of the data lifecycle undocumented.
Data lineage is more than a technical feature; it is a cornerstone for understanding how data flows, transforms and integrates across systems. For organizations managing data products in the cloud, lineage is crucial for troubleshooting, maintaining data quality and adapting to evolving business needs. In an era where AI and advanced analytics drive innovation, the ability to trace and trust data pipelines has become indispensable. For this client, addressing the gap in automated lineage was not just a requirement, it was a necessity.
Bridging the Gap with SparkLin – Datashift’s approach
Recognizing the importance of comprehensive data lineage, we sought a solution to complement Purview’s capabilities. Enter SparkLin, an open-source solution specifically designed to capture and visualize end-to-end data lineage in Spark environments like Azure Synapse.
By integrating with Microsoft Purview, SparkLin delivers detailed lineage insights that enhance governance, transparency and audit readiness. This functionality is critical for impact analysis, debugging, and diagnostics, ensuring robust, compliant data operations. SparkLin fills Purview’s functionality gap by ensuring that all data transformations are accurately tracked and documented.
This project exemplifies Datashift’s collaborative approach, where data governance experts and data engineers work hand in hand to deliver the best outcomes. By refining and updating SparkLin’s codebase, we ensured seamless integration into their data governance framework, resulting in a robust and efficient solution.
How SparkLin Works: Tapping into Azure’s Ecosystem
SparkLin leverages several Azure components to extract and process data lineage from Synapse Spark notebooks. The process begins with the Open Lineage Spark Listener, which is initiated in the Synapse cluster to capture lineage data during runtime. This data is emitted in JSON format, which is then sent to the HTTP endpoint of the first function app. This function app will analyse the JSON file and store it into a blob container if it contains relevant information about the lineage.
Once stored, a second Function App is triggered which executes processing workflows whenever new blobs are detected. At this stage, SparkLin’s core Parser converts the lineage information into a format compatible with Microsoft Purview. Finally, the data is pushed to both Azure Table Storage and Purview, where it can be queried for deeper analysis or visualized directly in Purview.
SparkLin meets Purview: A Unified Data Lineage Solution
By integrating SparkLin with Purview, we created a comprehensive data governance framework for our client. SparkLin’s automated lineage feature bridged a critical gap. This integration delivers clear, visual insights into data pipelines, including transformation, derived columns and join conditions.
The enhanced transparency allows teams to debug, analyze, and manage their data pipelines efficiently. Moreover, the client can now confidently roll out new data products, navigating their previously fragmented data landscape with greater clarity. This integration has accelerated their journey towards data maturity, enabling them to innovate with trust and efficiency.
Why Datashift?
At Datashift, we bring together the best of both worlds: deep expertise in data governance and hands-on experience in data engineering. This unique combination allows us to tackle complex challenges, whether it’s optimizing tools like Microsoft Purview or building custom solutions like SparkLin. For organizations looking to unlock the true potential of their data, we are the partner who delivers both strategy and execution.
Interested in how Datashift can help transform your data landscape? Visit datashift.eu for more insights and success stories.
SparkLin reference: https://techcommunity.microsoft.com/blog/microsoftsecurityandcompliance/end-to-end-data-lineage-from-spark-big-data-environment/3721358