The Data Mesh concept, first introduced by Zhamak Dehghani, is gaining significant popularity. Data Mesh proposes a new approach to thinking about data based on a distributed architecture, governance, and ownership for enterprise data management. We have seen some of our customers adopting the Data Mesh concept and creating their customized versions. As implementation partners, we often must answer the ‘double-click’ details –
- What is Data Product and how is it different from what we are doing today?
- How do I prepare my platform to support Data Mesh architecture?
- How to cultivate the data-driven decision-making culture by leveraging Data Products and Enterprise Data Marketplace?
In this 3-blog series, we will try to answer these questions and introduce Persistent Data Foundry, a platform that supports Data Mesh architecture.
Let’s first talk about the characteristics of Data Products. Let’s also understand how to convert the existing inventory of Data Pipelines (ETL/ELT) and Data Sets to Data Products.
Characteristics of Data Products
As per Zhamak’s initial positioning, Data Product should have the following characteristics:
- It should be owned by the Domain Team.
- It should be usable and valuable to other domains.
- It should be feasible to implement
- It should be discoverable and self-describing, both semantically and syntactically.
- It should be Trustworthy, i.e., the owner should be able to measure completeness, lag, timeliness, and statistical shape and stand behind it.
- It should be interoperable, or in other words, governed by global standards.
- It should be secure. The Data Product Owner should be able to define the access control policy.
But we have Data Pipelines and Datasets in our Data Lake, and not Data Products
- There are Data Pipelines and Datasets. The ownership is with the central Data and Analytics (D&A) Team
- The Datasets are usable and valuable to the domain owning them. It is probably useful to others
- The Datasets are definitely feasible to implement, in fact, they are already implemented
- Datasets may be discoverable if a Catalog is already implemented and maintained
- Datasets are not self-describing. They are at mercy of the Cataloging Tool
- Datasets might be trustworthy. Only the pipeline team knows about the quality measurement
- Datasets are intentionally NOT governed by global standards
- Datasets are by default protected and access is given on a case-by-case basis
Datasets to Data Products – Our Migration Methodology
No enterprise wishes to simply throw away their existing investments that have gone into implementing Data Lake. So, the question is, how do we convert the existing Data Pipelines and Datasets into Data Products? Here is how we approach it:
Phase | Task | Outcome |
Global Discovery | Go over the datasets across the lake/Data Warehouse (DWH) and identify the domain owners. Mark first-party and third-party ownership and second-party data usage. The task is simplified if the data catalog is maintained. Else, we may need to go over the technical catalog like Hive Metastore. |
|
Local Discovery | The domain owners conduct the local discovery of the datasets identified for them. They identify which data entities are created by them (first party), which entities are purchased by them (third party) and which ones are only consumed by them. Once the tentative definition of Data Products is created, define the Data Quality (DQ) rules to identify trustworthiness and global access policy. Create a roadmap to convert the dataset into the Data Product. |
|
Ownership Handover | The organization needs to decide whether it wants to do a big-bang transformation of domain-based ownership. The alternative mechanism is to create a domain-based technical team under D&A to own a specific set of domains |
|
Migration Implementation | Start by identifying the Data product gaps. For example, DQ rules may not have been implemented, and so on. Then create a project plan and fill in the gaps to convert the data sets into data products |
|
Run | Monitor the data product from the DQ, Freshness, and other agreed-upon SLAs. Trap the schema drifts using Data Product Versioning. Measure and improve the adoption |
|
Summary
The Data Mesh pattern introduces two concepts – Distributed Domain Driven Ownership and Data Products. Domain-driven ownership needs deep organizational change, which is not simple to adopt to. The Data Product concept is easy to adopt to and is very useful to make an organization data-driven. Migrating from Data Sets to the Data Product concept is a right first step towards marching to the Data Mesh paradigm.