Sep 19, 2023
As the data landscape continues to rapidly transform, the concept of data products has emerged as pivotal to managing and leveraging data effectively. Not to mention the conversation around the data mesh architecture that has further popularized this concept. Speaking of the data mesh, I’ve seen two kinds of people in the data community: those who believe it is an upcoming trend, and those who argue that we have been practicing data mesh all along.
Regardless of which group you belong to, we can all agree that data should be managed as a product. Aligning with this belief, we at DataHub have taken a community-guided approach to defining, developing, and building the Data Product within DataHub.
In this article, I share an overview of DataHub’s vision and current model for Data Products, as well as our vision and commitments for the future.
To put it simply, Data Products in DataHub are simply an extension of:
Data Products represent collections of assets that are important to you – that you can combine to manage and maintain. Data Products have owners, tags, glossary terms, and documentation.
Data Products are a way to organize and manage your Data Assets, such as Tables, Topics, Views, Pipelines, Charts, Dashboards, etc., within DataHub. They belong to a specific Domain and can be easily accessed by various teams or stakeholders within your organization.
Here are three reasons that have guided our thinking about Data Products and why the community needs them:
At DataHub, bringing business users and technical users together by enabling Shift-Left practices in metadata – for instance, by enabling data products to be
(More on this in the next section)
We wanted the community to guide our understanding and practical implementation of Data Products. We set up a dedicated Data Products channel, Design Data Product Entity, to bring the community together to guide us on
The community delivered, and how – with ideas, visualizations, use cases, academic papers, and more.
Recognizing the complexity of data products, we adopted a minimum viable product (MVP) strategy for modeling. Here’s how this played out:
Data Products within DataHub help you “define a boundary around these assets to create and define your own Data Product”
For example, here’s a Revenue Data Product, with the pipelines that your team works on, the tables they produce, and this is the dashboard that's getting exported.
And with that, you have your own Data Product with ownership tags, glossary terms, and documentation – that you manage and can share with other teams.
In the next iteration, we’re working on capabilities that help you to mark certain elements within the Data Product (say, as being internal or shareable, as in the example below).
This way, downstream teams can depend on outputs from your Data Product in the definition of their Data Product.
With Data Products, we were clear about one thing: managing Data Product definitions just like we manage everything else – like code.
As I mentioned earlier, this meant enabling business users (usually personas that are not familiar with git) and technical users (who are very familiar with git-based practices) to participate in Shift-left approaches to manage into the Data Product definition life cycles.
With DataHub’s approach to Data Products, you can bring the business user and the developer
together through the ability to
You can create a Data Product in DataHub using the UI or via a YAML file.
DataHub also comes with a YAML-based Data Product spec for defining and managing Data Products as code. Here is an example of a Data Product named "Pet of the Week" which belongs to the Marketing domain and contains three data assets. The Spec tab describes the JSON Schema spec for a DataHub data product file.
You can assign an asset to a Data Product using the Data Product page or the Asset's page as shown below.
(Ways to add Data Assets to a Data Product in DataHub)
Here’s a DataHub Townhall video recorded in May 2023, showing you exactly how Data Products in DataHub work.
The best part is you can also create views in DataHub that only include Data Products – for when you want your users to stay firmly in the Data Product space – without having to view tables, dashboards, and other more technical aspects.
We believe that Data Products can revolutionize the way enterprises manage, share, and they are fundamental to effective data management, and it’s only by embracing the concept of data as a product that organizations can unlock the full potential of their data assets.
As we move forward, we are committed to keeping the community involved in shaping the future of Data Products. We’d love to hear from you. Take Data Products out for a spin and let us know what you think.
Our Data Products Feature Page has everything you need to get started.
We built a SQL lineage parser that's schema-aware and can generate accurate column-level lineage from SQL queries. In our tests, it works significantly better than other open-source, Python-based lineage tools.
Some partnering announcements are especially sweet—like this one.
Are you ready to dive into the world of open source and make a meaningful contribution? Hacktoberfest 2023 is here, and we're thrilled to invite you to participate by contributing to the DataHub project.