Enhancing data validation processes with Microsoft Fabric
12 December 2023
Most businesses we encounter today are very much aware of the potential of their data. But for any organization looking to extract value from its data, there is one condition that absolutely cannot be overlooked: leveraging data to gain insights and support decisions or innovations is virtually impossible without timely and well-scoped data validation. That’s precisely where event-driven data processing & Microsoft Fabric come in.
Understanding data validation
To begin with, it is essential to understand that data validation has many facets.
We need data to be accurate, ensuring it’s error-free and adheres to the expected format. It's not just a matter of having correct data but, more importantly, of knowing that the data is reliable and can be used confidently for analytics and decision-making.
We need data to be complete, verifying that all necessary data is included in the dataset. Since incomplete data can lead to false insights and wrong decisions, no significant information must be missing.
We need data to be consistent, maintaining data uniformity over time. Data should conform to standard formats and definitions to remain reliable and comparable across time periods and segments.
And while there are several other aspects to data validation, these 3 concepts are among the most important.
Understanding event-driven data processing
Like data validation, event-driven data processing is a broad topic that can similarly be boiled down to some key concepts.
Event notification is the concept whereby systems inform other systems of events without expecting a response. It's an effective way to ensure that different systems or components within an architecture are updated about relevant events in real-time.
Event-driven data processing assumes event-carried state transfer, meaning that events carry the complete state change, and there is no further need for additional information requests. Event-carried state transfer reduces the overhead on systems and ensures that each system has all the information needed to process the event.
Event sourcing, using events as the primary source of truth for data, enables data reconstruction and comprehensive audit trails. Therefore, it’s a powerful method for understanding the sequence of events that led to a particular state in a system.
Deploying data validation processes with Microsoft Fabric
A while ago, we came across the need to apply those concepts at one of our clients who offers services through web and app platforms. Due to data quality issues, our client was confronted with missing data in downstream data models and inaccurate personalization across platforms. Moreover, solving these problems retroactively proved to be quite a challenge (especially since product development and tracking data collection were outsourced).
To build an environment to address those challenges, we identified three primary needs:
[1] Minimize the time between a data quality issue and its detection,
[2] Have a visual history of data quality issues that occurred over a given period,
[3] Send out alerts whenever a potential data quality issue is detected.
We then mapped those needs to Microsoft Fabric components:
[1] We used Azure Event Hub for sourcing events and a Spark Job in Microsoft Fabric to stream the data into OneLake in Delta format. Leveraging Apache Spark Structured Streaming, we achieved near-real-time data movement into OneLake. That effectively makes the time between the occurrence of an issue and its detection as short as possible.
[2] We used a semantic model in Microsoft Fabric as the source to build a visual report of data issues over time in Power BI. Direct Lake mode, a groundbreaking semantic model capability, and automatic page refresh enabled near-real-time observability of data and data issues in Power BI.
[3] Using Reflex in Microsoft Fabric, seamlessly integrated with Microsoft Teams, it became possible to send timely alerts once a potential data quality issue had been detected.
How event-driven data processing can help you minimize the impact of data quality issues
Let's look at how things would work for you in the type of environment we propose here.
Once notified of a potential data quality issue, you can immediately navigate to the Power BI report to evaluate its impact. Using the Power BI visualizations, you can assess whether the problem is short-lived, related to a specific device class, has a broader impact, etc.
Now that you understand the impact of the data quality issue, you can inform the responsible product team with a clear visual narrative, including concrete examples or scenarios that the team can test themselves. If necessary, you can also reach out proactively to business end-users and outline the impact of the underlying issue and the following steps to be taken.
It is then up to the product team to provide feedback and a timeline for resolving the data quality issue. Once that has been done, you can easily verify whether or not the issue has been fixed (again, documenting the current situation with a Power BI visualization).
What experience has taught us
But how does this actually work in the real world? We have been running a similar solution at a client for about 6 months, so we are well-positioned to share a few things we have learned.
Align all stakeholders
For all the cool technology, achieving impact remains as challenging as ever. Everything hinges on your capacity to effectively communicate and manage change. So, if you want to achieve impact, just like we did at our client, map out the entire data validation process from issue detection to integration and align it with all stakeholders. Detection of data quality issues is useless if there is a delayed or missing response.
Account for the needs of all teams
That being said, data quality challenges are real and cannot always be proactively avoided upstream. Even if more and more companies adhere to the principles of Data Mesh, complete vertical integration of (purely) domain-oriented data teams is not always achievable in practice. Different teams may have very different requirements or even altogether different concepts of data quality, and you should take that into account from the beginning.
Create speaking Power BI visualizations
Setting up near-real-time data observability in Power BI is now easier than ever, combining Spark, OneLake, and Direct Lake in Microsoft Fabric. The added value of that lies in the ability to take prompt action upon detection of an issue and to provide a visual outline of the potential issue scenario. At least, if you create your Power BI visualizations such that the impact of a potential issue can be assessed quickly (focusing on critical attributes, the level of granularity required, and so forth).
Let’s get in touch
At our client, we’ve seen that well-implemented data validation processes nurture trust between all involved parties. Whereas previously, it used to be the business end-users who noticed missing data in the BI dashboards, the roles have shifted so that the data team now takes proactive steps. As a result, the time between detection and resolution of data quality issues is significantly reduced, and less data is lost.
Are you eager to learn more about how we can help you achieve impact by implementing data validation processes? Get in touch.