Data Governance
Data Engineering
Open Source
DataHub
Community
DataHub Community
Feb 3, 2023
This post, part of our DataHub Blog Community Contributor Program, is written by Venkata Krishnan, a DataHub Community Member
Data Governance used to be a fancy topic roughly a decade ago (circa 2011/ 12), and when I first heard about it, my first question was:
Why do we need Data Governance at all?
This resulted from my mind’s voice asking: Are we being very strategic by funding a research project here?
I found my answers in these examples:
As is clear in the given scenarios, the simple starting point of a Data Governance journey is knowing where customer (master) data is, how many systems consume it now, and how many can potentially consume it in the future.
But an equally important question needs to be answered: How do we provide an authoring mechanism with a straightforward workflow (with a feedback loop) to ensure that the customer record is accurate? This also requires business knowledge of different customer touchpoints and how communications and interactions could be optimized.
Why has the importance of Data Governance exploded in growth?
The earliest mention of Data Governance on Wikipedia dates back to 2006 (if you’re curious, check out the difference between then and 2022 here to put into perspective how far along we have come on.
In my view, the top reasons for this explosion are cloud computing and the SaaS business model — in addition to the operational and monetary reasons that always existed. While cloud computing has commoditized computing and storage, SaaS solutions have changed how we think about business. Additionally, Artificial Intelligence and Data Science have matured ever since Cloud computing came into the picture. These developments have led to an explosion in the number of data businesses generate.
And yet, even today, modern data stacks/platforms/systems work with silos of information, often relying on humans with business and operations knowledge and good data stewardship to arrive at optimizations.
Can we leave Data Governance to individuals? Can they decide how to use data most optimally?
It is humanly not possible (or even recommended) to master 10 LoBs in their depth. Too much could be lost in translation and the transit from the past to the future. It requires data and business literacy, and a huge dose of teamwork — especially in large businesses. So, a democratic approach to arrive at “data and business literacy” with a solid framework becomes critical — especially when you think about rapidly growing companies that often merge with, or acquire other companies,
Now that the “why” of Data Governance is clear, let us get into the “what” and the “how”.
Photo by John Schnobrich on Unsplash
Google “Data Governance use-cases” in 2022, and the following top 5 use-cases come up
P.S: I have deliberately changed the order to bring ‘Data Discovery’ to the top. Sorry, Google Search!
When one of my project partners asked me, “What would be a good step to start a good Data Governance initiative?”, I was initially puzzled, though I have been a data professional for most of my career. But soon enough simple common-sensical question brought the answer to me:
“Without knowing what we are going to govern, what can we govern?”
The simple answer: Assets in the context of Data and Business (in line with “data and business literacy”).
Active Metadata Catalogs/ Discovery tools are essential in a modern data stack to manage all the Data Assets in near-real time.
As business and the underlying data structures keep changing dynamically and rapidly, metadata changes at source must be accommodated, without adversely affecting the data pipelines, analytical systems, and other BI systems downstream.
Do check out the Orielly articles linked in the References section for validation of this approach.
Data Governance is a set of policies and processes. The following are the key ingredients to Data Governance success over time:
It all begins with identifying the right data catalog/ discovery tools.
This is rare, but it is the ideal case. If most of the data a company deals with are generated internally, we can be extremely paranoid about how we instrument the application(s) to collect the data we need to govern. Instead of creating a huge haystack of data from which to search needle-sized data, we could arrive at the metrics we need in advance and work backward from there.
Data governance is not about generating and managing huge volumes of data without a purpose. However, practically, when a company starts up, this level of maturity is rare as the founders/ engineering leaders of the company may be business or product experts, not data experts. Also, the cost involved in arriving at a good Data Governance strategy and the corresponding solution involves a considerable investment of time and money, something businesses may not be able to afford in the early days.
To make things harder, for external sources of data, it is not always possible to plan, given that business needs keep changing and evolving.
Guidelines for an ideal solution:
Key recipients:
There are multiple user personas in a company, who could benefit from such a solution, and these are typically (though not restricted to):
Or in effect, anyone looking to get value out of data.
A granular/ scientific approach in identifying all the features outlined above in a Catalog & Discovery & Observability tool is one part of solving the problem. However, finding one tool that solves all your unique problems is almost impossible!
Trust me: A tool just enables you to do the job, but a Data Governance journey is a tricky balance of people, policies, and processes — and realizing this is critical to any such journey:
My recommendation: Identify the use cases that are must-haves (critical), ideal candidates (high), good to have (medium) & anything else (low) based on the weights of their usefulness for a successful Data Governance strategy. It is critical to have a clear requirement document that describes all the use cases and with a level of detail that clearly shows what everyone eventually wants the solution to be.
Be it a Data Platform or Data Governance initiative or Data Products or Data Mesh implementation, everything hangs on business priorities and cost. Where do you want to spend time and money in the given context of business and operational needs? Is the proposed solution even good in the given business context?
A clear risk-free approach of dividing the problem into chunks and solving one at a time is crucial to a good Data Governance initiative in the longer run.
If you have read this far, I’m sure you have heard terms like Data Mesh, Data Contracts, Data Fabric, Data Virtualization, etc.
What do all of these have to do with Data Governance?
Thankfully, Data Governance cuts across all businesses/ domains and does not change with the underlying Data Architecture or Data Platforms. It is not a commodity service/ product, but a discipline (policies & processes) that the organization carries in the long term
While it can be a challenge to carve out the time, money, people, and resources for data governance, the sooner you do it, the better it is for the business and everyone involved.
Whatever the design, avoiding a single point of failure is crucial, and the members of such a team can be:
There are several open-source tools for Data Cataloging/ Data Discovery, however, the most popular ones are DataHub, Amundsen, and Atlas in that order (in terms of rich features and Github stars and forks). I felt DataHub has great community support, but that is, of course, my personal opinion at this point.
Please reach out to me if you are interested in learning more at venkat@resolv360.com
https://www.oreilly.com/library/view/data-governance-the/9781492063483/ch01.html
https://www.oreilly.com/radar/governance-and-discovery/
Inspire others and spark meaningful conversations. The DataHub Community is a one-of-a-kind group of data practitioners who are passionate about enabling data discovery, data observability, and federated data governance. We all have so much to learn from one another as we collectively address modern metadata management and data governance; by sharing your perspective and lived experiences, we can create a living repository of lessons learned to propel our Community toward success.
Check out more details on how to become a DataHub Blog Contributor, we can’t wait to speak with you! 👋
Join us on Slack • Sign up for our Newsletter • Follow us on Twitter
Data Governance
Data Engineering
Open Source
DataHub
Community
NEXT UP
If you're part of a data team responsible for a business-critical dataset, dashboard, or any other data asset, you know how important it is to stay on top of any upstream changes before they impact you and your stakeholders. What if a table you rely on just got deprecated? What if a column you use was removed upstream? Or if an upstream table missed an update and now has stale, un-synced data? Staying updated on critical assets in real time is critical to effective data monitoring and data quality. Given the complexity of today’s data environment, doing this is no walk in the park. But what if there was a way to stay in the loop all the time? And know exactly what happened – right when it happened? With Acryl DataHub's Subscriptions and Notifications feature, you can.
Maggie Hays
2023-09-20
See an overview of DataHub’s vision and current model for Data Products, as well as our vision and commitments for the future.
Shirshanka Das
2023-09-19
See how we’ve implemented Data Contracts within DataHub, how you can get started, and how the Data Products functionality can help you get the most out of Data Contracts.
Shirshanka Das
2023-09-19