BACK TO ALL POSTS

Data Products in DataHub: Everything You Need to Know

Shirshanka Das

Sep 19, 2023

As the data landscape continues to rapidly transform, the concept of data products has emerged as pivotal to managing and leveraging data effectively. Not to mention the conversation around the data mesh architecture that has further popularized this concept. Speaking of the data mesh, I’ve seen two kinds of people in the data community: those who believe it is an upcoming trend, and those who argue that we have been practicing data mesh all along.

Regardless of which group you belong to, we can all agree that data should be managed as a product. Aligning with this belief, we at DataHub have taken a community-guided approach to defining, developing, and building the Data Product within DataHub.

In this article, I share an overview of DataHub’s vision and current model for Data Products, as well as our vision and commitments for the future.

What is a Data Product?

To put it simply, Data Products in DataHub are simply an extension of:

  • treating “data as a product” and
  • applying product thinking to data assets.

Data Products represent collections of assets that are important to you – that you can combine to manage and maintain. Data Products have owners, tags, glossary terms, and documentation.

Data Products are a way to organize and manage your Data Assets, such as Tables, Topics, Views, Pipelines, Charts, Dashboards, etc., within DataHub. They belong to a specific Domain and can be easily accessed by various teams or stakeholders within your organization.

Data Products: Why We Need to Look at Data as a Product

Here are three reasons that have guided our thinking about Data Products and why the community needs them:

  • Equipping business users: They enable a move towards discoverable and documented data assets that equip non-technical users to use them – without having to rely on analysts or developers.
  • Interconnectivity and collaboration: A key function of Data Products is to enhance the sharing capabilities between teams and foster collaboration across different data products. By defining a Data Product, or marking specific components within it as internal or shareable, downstream teams can depend on the outputs of other data products.
    This interconnectedness allows for synergistic relationships and empowers organizations to leverage existing data assets effectively.
  • Integration with existing workflows: To bridge the gap between business users and developers, we need to incorporate familiar development practices into data product definitions. This approach ensures that changes made to data product definitions are seamlessly incorporated, thereby streamlining workflows and facilitating collaboration.

At DataHub, bringing business users and technical users together by enabling Shift-Left practices in metadata – for instance, by enabling data products to be

  • defined using YAML and
  • managed as code.

(More on this in the next section)

Data Products: Inputs from the DataHub Community

We wanted the community to guide our understanding and practical implementation of Data Products. We set up a dedicated Data Products channel, Design Data Product Entity, to bring the community together to guide us on

  • What Data Products look like to them
  • How Data Products should be represented within DataHub

The community delivered, and how – with ideas, visualizations, use cases, academic papers, and more.

Recognizing the complexity of data products, we adopted a minimum viable product (MVP) strategy for modeling. Here’s how this played out:

  • What: Our focus was to define boundaries around relevant data elements, such as pipelines, tables, and dashboards, to create distinct data products.
  • Why: This approach enables attaching ownership tags, glossary terms, and comprehensive documentation to each data product. Of course, further iterations will be necessary to refine and expand the data product entity.
  • How: Here’s a simple raw technical metadata graph we’re all familiar with – with streams,  tables, processing pipelines, more tables, and dashboards.

Data Products within DataHub help you “define a boundary around these assets to create and define your own Data Product

For example, here’s a Revenue Data Product, with the pipelines that your team works on, the tables they produce, and this is the dashboard that's getting exported.

And with that, you have your own Data Product with ownership tags, glossary terms, and documentation – that you manage and can share with other teams.

In the next iteration, we’re working on capabilities that help you to mark certain elements within the Data Product (say, as being internal or shareable, as in the example below).

This way, downstream teams can depend on outputs from your Data Product in the definition of their Data Product.

The DataHub Approach to Modeling Data Products

With Data Products, we were clear about one thing: managing Data Product definitions just like we manage everything else – like code.

As I mentioned earlier, this meant enabling business users (usually personas that are not familiar with git) and technical users (who are very familiar with git-based practices) to participate in Shift-left approaches to manage into the Data Product definition life cycles.

With DataHub’s approach to Data Products, you can bring the business user and the developer
together through the ability to

  • define data products as YAML,
  • check them into Git (and have them synced with DataHub)
  • enable the business user can collaborate and refine the definition of the data product
  • flow changes back right where they started so that we no longer have to have these two worlds live so far from each other.


Creating a Data Product in DataHub

You can create a Data Product in DataHub using the UI or via a YAML file.

DataHub also comes with a YAML-based Data Product spec for defining and managing Data Products as code. Here is an example of a Data Product named "Pet of the Week" which belongs to the Marketing domain and contains three data assets. The Spec tab describes the JSON Schema spec for a DataHub data product file.

Assigning Assets to a Data Product

You can assign an asset to a Data Product using the Data Product page or the Asset's page as shown below.

(Ways to add Data Assets to a Data Product in DataHub)

Here’s a DataHub Townhall video recorded in May 2023, showing you exactly how Data Products in DataHub work.

Link: https://www.youtube.com/watch?v=vpz62mpvUVs

The best part is you can also create views in DataHub that only include Data Products – for when you want your users to stay firmly in the Data Product space – without having to view tables, dashboards, and other more technical aspects.

The Way Ahead for Data Products

We believe that Data Products can revolutionize the way enterprises manage, share, and they are fundamental to effective data management, and it’s only by embracing the concept of data as a product that organizations can unlock the full potential of their data assets.

As we move forward, we are committed to keeping the community involved in shaping the future of Data Products. We’d love to hear from you. Take Data Products out for a spin and let us know what you think.

Our Data Products Feature Page has everything you need to get started.

Connect with Acryl and DataHub

Join Us on Slack! Interested in Learning More? Let's Chat!

NEXT UP

Data Quality Should be Part of the Data Catalog - Introducing Acryl Observe

We didn’t go looking for an excuse to develop a data observability solution. There’s more than enough to keep us occupied in relentlessly improving the best data catalog on the planet! But the more experience we gained in working closely with Acryl customers, the clearer it became that data quality, data discovery, and data governance aren’t just complementary, but mutually reinforce one another. Acryl Observe provides data teams with everything they need to detect data breakages immediately, contain their downstream impact, keep stakeholders in the loop, and resolve issues fast—so that data teams can spend less time reacting and more time preventing.

John Joyce

2024-04-16

Metadathon - Why MetaData Matters

Vaidehi Sridhar, product manager at PayPal, came up with a clever solution: a hackathon—but for documentation and metadata. The goal of Sridhar’s and PayPal’s “Metadatathon” was to crowdsource the labor involved in documenting and adding rich context to the company’s distributed data assets. “Lack of documentation was one of the major problems most of our users called out,” Sridhar explains.

Stephen Swoyer

2024-04-16

Hack Your Way to Data Quality with a "Metadatathon"

PayPal Product Manager Vaidehi Sridhar probably wouldn’t call herself a mastermind, but she did execute on a genius idea to improve data quality across PayPal’s sprawling ecosystem. Even better, Sridhar and her team laid the groundwork for PayPal to federate data governance for its decentralized teams—as well as automate many types of tedious governance tasks. The genius idea? A metadata hackathon, or “Metadatathon.”

Stephen Swoyer

2024-03-19

Get started with Acryl today.
Acryl Data delivers an easy to consume DataHub platform for the enterprise
See it in action
Acryl Data Logo
Acryl DataHub
Acryl Observe
TermsPrivacySecurity
© 2024 Acryl Data