Come See Acryl at Coalesce - October 16-19 | San Diego, CA

Acryl Logo
BACK TO ALL POSTS

Data Governance

Data Engineering

Open Source

DataHub

Community

Starting a Data Governance Journey

DataHub Community

Feb 3, 2023

This post, part of our DataHub Blog Community Contributor Program, is written by Venkata Krishnan, a DataHub Community Member

Data Governance used to be a fancy topic roughly a decade ago (circa 2011/ 12), and when I first heard about it, my first question was:

Why do we need Data Governance at all?

This resulted from my mind’s voice asking: Are we being very strategic by funding a research project here?

I found my answers in these examples:

  1. In a retail business, a bad address or incorrect demographic information can cause parcels to return/make rounds, and resending them would result in a few more transits. This is a typical Master Data Management/ Data Quality issue that can result in a sizeable monetary loss.
  2. Imagine that you need communication with a customer for two different products/ lines of business (LoB). If we maintained the customer’s demographic details in two different places/ systems/, we could end up sending two different mailers that could — and should — have been consolidated into one. Say, if the cost of sending the mailer is $1, and if there are 100,000 customers in the system, we could be spending $200,000 instead of $100,000.

As is clear in the given scenarios, the simple starting point of a Data Governance journey is knowing where customer (master) data is, how many systems consume it now, and how many can potentially consume it in the future.

But an equally important question needs to be answered: How do we provide an authoring mechanism with a straightforward workflow (with a feedback loop) to ensure that the customer record is accurate? This also requires business knowledge of different customer touchpoints and how communications and interactions could be optimized.

The WHY of Data Governance

Why has the importance of Data Governance exploded in growth?

The earliest mention of Data Governance on Wikipedia dates back to 2006 (if you’re curious, check out the difference between then and 2022 here to put into perspective how far along we have come on.

In my view, the top reasons for this explosion are cloud computing and the SaaS business model — in addition to the operational and monetary reasons that always existed. While cloud computing has commoditized computing and storage, SaaS solutions have changed how we think about business. Additionally, Artificial Intelligence and Data Science have matured ever since Cloud computing came into the picture. These developments have led to an explosion in the number of data businesses generate.

And yet, even today, modern data stacks/platforms/systems work with silos of information, often relying on humans with business and operations knowledge and good data stewardship to arrive at optimizations.

Can we leave Data Governance to individuals? Can they decide how to use data most optimally?

It is humanly not possible (or even recommended) to master 10 LoBs in their depth. Too much could be lost in translation and the transit from the past to the future. It requires data and business literacy, and a huge dose of teamwork — especially in large businesses. So, a democratic approach to arrive at “data and business literacy” with a solid framework becomes critical — especially when you think about rapidly growing companies that often merge with, or acquire other companies,

Now that the “why” of Data Governance is clear, let us get into the “what” and the “how”.

Group on laptop

Photo by John Schnobrich on Unsplash

What does a Good Data Governance Journey Look Like?

Google “Data Governance use-cases” in 2022, and the following top 5 use-cases come up

  • Data discovery and data literacy provisions.
  • Collaborative analytics or building new data products.
  • Data privacy compliance.
  • Create a centralized repository of all standardized business terms.
  • Centralized data access management.

P.S: I have deliberately changed the order to bring ‘Data Discovery’ to the top. Sorry, Google Search!

When one of my project partners asked me, “What would be a good step to start a good Data Governance initiative?”, I was initially puzzled, though I have been a data professional for most of my career. But soon enough simple common-sensical question brought the answer to me:

“Without knowing what we are going to govern, what can we govern?”

The simple answer: Assets in the context of Data and Business (in line with “data and business literacy”).

Active Metadata Catalogs/ Discovery tools are essential in a modern data stack to manage all the Data Assets in near-real time.

As business and the underlying data structures keep changing dynamically and rapidly, metadata changes at source must be accommodated, without adversely affecting the data pipelines, analytical systems, and other BI systems downstream.

Do check out the Orielly articles linked in the References section for validation of this approach.

What Should be Part of a Good Data Governance Solution?

Data Governance is a set of policies and processes. The following are the key ingredients to Data Governance success over time:

  • An Active data catalog
    - Inventory of data assets
    - Data lineage ideally with support for column lineage
  • Good data lifecycle management
    - Data collection/ ingestion frameworks
    - Data storage
    - Data retention policies
  • Data pipeline orchestration
  • Data security
    - PII Data Management
    - Data Access Controls (RBAC)
  • Data democracy for public data
  • Data quality
    - MDM
    - Quality Metrics
  • Data monitoring and alerting
    - Availability of data at the right SLAs
    - Health
    - Anomaly detection
  • Using Data Science as a tool for better governance
  • Business intelligence

It all begins with identifying the right data catalog/ discovery tools.

How do we Crack this Data Governance Puzzle?

The Ideal Proactive Approach

This is rare, but it is the ideal case. If most of the data a company deals with are generated internally, we can be extremely paranoid about how we instrument the application(s) to collect the data we need to govern. Instead of creating a huge haystack of data from which to search needle-sized data, we could arrive at the metrics we need in advance and work backward from there.

Data governance is not about generating and managing huge volumes of data without a purpose. However, practically, when a company starts up, this level of maturity is rare as the founders/ engineering leaders of the company may be business or product experts, not data experts. Also, the cost involved in arriving at a good Data Governance strategy and the corresponding solution involves a considerable investment of time and money, something businesses may not be able to afford in the early days.

To make things harder, for external sources of data, it is not always possible to plan, given that business needs keep changing and evolving.

A More Pragmatic Approach

Guidelines for an ideal solution:

  1. Identify a good Data Catalog that integrates with existing and future/planned Data Systems. Ensure that it is Data Platform agnostic
  2. On-prem/ cloud agnostic
  3. Scalability
  4. Cost-effective, and offers democratic access to thousands of users
  5. Enables a data culture across the company
  6. Supports expansions of business, M&A, etc.
  7. Supports all/ most popular open-source data tools available
  8. Has a buzzing and helpful community
  9. Offers regional context and support (for global companies)

Key recipients:

There are multiple user personas in a company, who could benefit from such a solution, and these are typically (though not restricted to):

  • CXOs
  • Product Managers
  • Data Analysts
  • Data Scientists
  • Data Engineers/ Engineers
  • Data Architects
  • Data Stewards

Or in effect, anyone looking to get value out of data.

Finding the Right Solution

A granular/ scientific approach in identifying all the features outlined above in a Catalog & Discovery & Observability tool is one part of solving the problem. However, finding one tool that solves all your unique problems is almost impossible!

Trust me: A tool just enables you to do the job, but a Data Governance journey is a tricky balance of people, policies, and processes — and realizing this is critical to any such journey:

My recommendation: Identify the use cases that are must-haves (critical), ideal candidates (high), good to have (medium) & anything else (low) based on the weights of their usefulness for a successful Data Governance strategy. It is critical to have a clear requirement document that describes all the use cases and with a level of detail that clearly shows what everyone eventually wants the solution to be.

The Timing For a Good Data Governance Initiative

Be it a Data Platform or Data Governance initiative or Data Products or Data Mesh implementation, everything hangs on business priorities and cost. Where do you want to spend time and money in the given context of business and operational needs? Is the proposed solution even good in the given business context?

A clear risk-free approach of dividing the problem into chunks and solving one at a time is crucial to a good Data Governance initiative in the longer run.

Data Governance in the New World

If you have read this far, I’m sure you have heard terms like Data Mesh, Data Contracts, Data Fabric, Data Virtualization, etc.

What do all of these have to do with Data Governance?

Thankfully, Data Governance cuts across all businesses/ domains and does not change with the underlying Data Architecture or Data Platforms. It is not a commodity service/ product, but a discipline (policies & processes) that the organization carries in the long term

While it can be a challenge to carve out the time, money, people, and resources for data governance, the sooner you do it, the better it is for the business and everyone involved.

Team Structure

Whatever the design, avoiding a single point of failure is crucial, and the members of such a team can be:

  • Domain expert(s) of the LoB(s)
  • Engineer(s) — Software developers and Data related
  • Product Manager for any Data Products in specific
  • Data Analyst(s)
  • Documentation Experts
  • Program/ Project Manager
  • Data Steward(s) who can handle/ understand multiple LoBs/ Domains

Open Source Tools

There are several open-source tools for Data Cataloging/ Data Discovery, however, the most popular ones are DataHub, Amundsen, and Atlas in that order (in terms of rich features and Github stars and forks). I felt DataHub has great community support, but that is, of course, my personal opinion at this point.

Please reach out to me if you are interested in learning more at venkat@resolv360.com

References

https://www.oreilly.com/library/view/data-governance-the/9781492063483/ch01.html

https://www.oreilly.com/radar/governance-and-discovery/


Interested in becoming a contributor to the DataHub Blog?

Inspire others and spark meaningful conversations. The DataHub Community is a one-of-a-kind group of data practitioners who are passionate about enabling data discovery, data observability, and federated data governance. We all have so much to learn from one another as we collectively address modern metadata management and data governance; by sharing your perspective and lived experiences, we can create a living repository of lessons learned to propel our Community toward success.

Check out more details on how to become a DataHub Blog Contributor, we can’t wait to speak with you! 👋

Connect with DataHub

Join us on SlackSign up for our NewsletterFollow us on Twitter

Data Governance

Data Engineering

Open Source

DataHub

Community

NEXT UP

Simplifying Data Monitoring & Management with Subscriptions and Notifications with Acryl DataHub

If you're part of a data team responsible for a business-critical dataset, dashboard, or any other data asset, you know how important it is to stay on top of any upstream changes before they impact you and your stakeholders. What if a table you rely on just got deprecated? What if a column you use was removed upstream? Or if an upstream table missed an update and now has stale, un-synced data? Staying updated on critical assets in real time is critical to effective data monitoring and data quality. Given the complexity of today’s data environment, doing this is no walk in the park. But what if there was a way to stay in the loop all the time? And know exactly what happened – right when it happened? With Acryl DataHub's Subscriptions and Notifications feature, you can.

Maggie Hays

2023-09-20

Data Products in DataHub: Everything You Need to Know

See an overview of DataHub’s vision and current model for Data Products, as well as our vision and commitments for the future.

Shirshanka Das

2023-09-19

Data Contracts in DataHub: Combining Verifiability with Holistic Data Management

See how we’ve implemented Data Contracts within DataHub, how you can get started, and how the Data Products functionality can help you get the most out of Data Contracts.

Shirshanka Das

2023-09-19

Get started with Acryl today.
Acryl Data delivers an easy to consume DataHub platform for the enterprise
See it in action
Acryl Data Logo
TermsPrivacySecurity
© 2023 Acryl Data