Data Mesh - Beyond the buzz
29 September 2022
Chances are you have heard a lot about data mesh lately. The main idea behind the data mesh notion is to enable a decentralized approach to sharing, accessing, and managing analytical data.
The concept of a data mesh
The concept of a data mesh is based on four key principles:
- domain-oriented ownership, requiring business domain teams to take responsibility for their data,
- data as a product, adopting a product philosophy that considers the needs of data consumers in the broadest possible sense,
- self-serve data platform, embracing a platform approach to building data infrastructure,
- federated computational governance, enabling interoperability of all data products.
That's quite a mouthful, isn't it? So, let’s make sure this data mesh concept doesn’t give you a headache.
Making sense of what a data mesh stands for
Forget about all the buzzwords, and let's dive into a practical example to help you make sense of what a data mesh stands for.
At Datashift, we love music and sports. Our data mesh example, therefore, involves both. As it happens, we want to enable a music platform that can generate playlists for different sports activities (such as running, cycling, …). The problem is that such a music platform needs data about the kind of music sportspersons liked and played during their previous sports activities.
To get things going, we appointed two dedicated teams: a playlist team (responsible for delivering the playlists) and a partnerships team (focused on getting the most out of the sports app data and working closely with the app developer). So how would those teams handle the task ahead of them in cooperation with the central data & data governance teams?
Enabling a music platform in the traditional way
- The playlist team asks the central data team to build dedicated sports playlists.
- Once the data team has cleared its backlog, a data scientist contacts the data governance team and requests access rights to check the central data lake for relevant sports app data. The partnership team must then get involved to help the data scientist understand the business context of this data.
- In the next step, the data scientist contacts data engineers to get the externally available sports app data into the data lake.
- The data engineers then collaborate with the partnerships team, spending time to understand the sports app data structures and APIs and building the pipelines between the sports app and the data lake. While doing this, they need to ensure that the sports app music IDs map onto the internal music IDs used in the data lake, among other application nuances.
- Finally, the data scientist can start training Machine Learning models to generate sports playlists for the playlist team.
In short:
- a lot of communications back-and-forth between individual domain teams on the one hand and the central data team on the other hand,
- a process that pivots around the data team and its centralized monolithic architecture (which is more than likely to become a bottleneck).
Enabling a music platform using a data mesh approach
- The playlist team checks the data mesh self-serve data platform’s discovery portal for any data products related to sports activities. As they find some sports-related data products owned by the partnerships team, they check the documentation and request access to those data products (which they get instantly).
- Because the available information proves insufficient, the playlist team gets in touch with the partnerships data product owner to request the sports app data they need for generating dedicated sports playlists.
- In succession, the partnerships team contacts the sports app developer, checks the sports app API documentation, and retrieves all data of interest.
- The partnerships team then uses the self-serve data platform's capabilities to transform the source data into a new data product that the playlist team can use. Finally, they create the documentation and the code to access the data, making sure to unify the music IDs with the standard used within the music platform, and share this on the self-serve data platform.
- The playlist team can now use the sports app data product in combination with other data products to deliver a sports playlist data product and share that data product again with the entire data mesh on the self-serve data platform.
In short:
- all communications happen directly between the individual domains, who take responsibility for their data and have the necessary skills and tools to build their own data products,
- All Data products are clearly documented, ready to use and available to all teams on the self-serve data platform, that is governed by a limited group of domain representatives, platform experts and data governance experts.
Can I get started with a data mesh today?
Data mesh principles such as domain-oriented ownership, data as a product, and federated governance are universal and can already be applied today in any organization.
Getting data specialists closer to the business and having business teams take ownership of the data they produce is an excellent way to push responsibility upstream and increase data literacy. In addition, offering data as a product to other internal data consumers increases the quality and interoperability of your data. And finally, the application of federated governance principles empowers domain teams to manage their own data models and quality while ensuring all teams stay aligned on several global policies that make the data products secure, compliant and interoperable.
Where does data mesh stand in the industry today?
As we already see such principles applied at some of our clients, there is no large organization yet that has fully embraced the data mesh approach. That is partly due to a lack of tools (for example, no complete self-serve data platform can be purchased right now). Still, it is also related to the fact that transforming traditional departments such as HR or marketing into autonomous domains with cross-functional teams that take full ownership of their data is quite challenging.
Just as much, connecting data products from legacy applications (mainframes) to a self-serve data platform might be a struggle. Hence, the data mesh approach currently seems to be more applicable to younger, digital organizations that don't carry too many legacy systems. For larger organizations, implementing a modern data platform is indispensable to enabling a data mesh approach.
A modern data platform is vital to make data mesh work efficiently
A potential risk of any decentralized domain-oriented process is that domain owners start setting up their own tools and infrastructure independently, effectively creating domain silos. What you need, instead, to make data mesh work efficiently is a unified vision shared by all business domains such that all domain infrastructure is set up according to well-defined standards. In addition to facilitating data exchange between different business domains, that also lowers infrastructure support and maintenance costs.
It is interesting to look at how we approach this at one of our clients. For example, if a domain team wants to start processing data, a domain environment would automatically spin up in the cloud using an infrastructure as code process: a data lake is set up, Azure Synapse is prepared, templates and pipelines are set ready, access security is configured, … and all of this is done automatically based on a limited number of parameters.
This kind of unified setup, where the entire environment – including development, staging, and production instances – is coded, would be difficult and is very costly in an on-premises environment. Building this kind of modern self-serve data platform requires nothing less than a Cloud-first strategy.
Eager to know more about our data mesh experience?
Get in touch. We'll be more than happy to discuss how data mesh principles can help you create more impact with your data.