BACK TO ALL POSTS

Acryl Data introduces lineage support and automated propagation of governance information for Snowflake in DataHub

Metadata

Data Governance

Data Lineage

Shirshanka Das

Nov 11, 2021

Metadata

Data Governance

Data Lineage

Introduction

DataHub () is the leading open-source Metadata Platform for the Modern Data Stack. Acryl Data is driving the open-source project in collaboration with LinkedIn and the broader open source community. The vibrant DataHub open-source community surfaces key use-cases across data discovery, data observability and data governance. As you would expect, Snowflake Data Cloud is very popular in our community and is an integral part of the modern data stack. There are two key themes we have heard repeatedly from the open-source community and our customers:

  1. The need to understand end-to-end lineage for derived datasets in Snowflake
  2. The need to effectively govern data as it flows through multiple systems and reaches the Snowflake Platform.

Problem Statement

In a typical enterprise, the data stack has a lot of diversity in terms of the number of tools and platforms in the overall stack. The task of finding data efficiently and understanding the end-to-end lineage of a derived dataset ends up taking 50% of the time in analysis workflows. Being able to stitch together lineage information across multiple tools, often spread across different cloud providers, is a major challenge.

In addition, classifying datasets with the right classification terms from a standardized governance taxonomy allows policy-driven handling of the data (access control, pseudonymization etc.). Source datasets are often tagged using manual/automated classification systems but derived datasets get generated at a rapid rate and the task of correctly classifying data becomes a losing battle without the right automation. This problem is even more exacerbated as data travels across multiple platforms each with their own conventions of recording governance data.

As an example, an enterprise may ingest data from external sources which end up as AWS Glue tables. Automated quality checks and data classifiers may be run against these tables to apply glossary terms from a standardized governance taxonomy. Depending on the classification, the sensitivity levels of a dataset can vary from “safe to use” to “highly confidential”. After the datasets get loaded into the Snowflake Data Cloud for further analysis, multiple derived tables may get generated. It is important to ensure that the governance information propagates in an automated manner from the source data (AWS Glue tables) to derived datasets in Snowflake so that the right policies are applied.

Solution

Lineage support for derived tables

DataHub’s approach to metadata management is to integrate into the operational fabric and collect the most reliable metadata at the source. We prefer a push-based approach wherever possible to maintain the freshness of metadata. Given the large number of out-of-the-box connectors, enterprises are able to quickly visualize end-to-end lineage across various platforms. An example is shown below.

visualized end-to-end lineage

visualized end-to-end lineage

Till now, it was not possible to capture lineage edges within the Snowflake platform. Derived tables that were created using create-table-as or other forms of copying/transformations were being left out of the lineage picture. With the introduction of the access history functionality , we are now able to complete the picture by capturing this information.

 access history functionality

access history functionality

In addition to table-level lineage, Snowflake access history functionality also provides information about sets of upstream and downstream columns participating in the lineage. This information is available in the dataset’s custom properties on DataHub as shown below.

Snowflake access history functionality

Snowflake access history functionality

An ingestion recipe with `include_table_lineage: True` in the snowflake source configuration would now populate the Snowflake table-level lineage in DataHub.

View full demo of the feature below:


Automated propagation of governance information

DataHub supports recording governance information through standardized business glossaries. Given the developer-friendly nature of DataHub, glossaries are version controlled checked-in artifacts. Here is an example of a simple business glossary file.

DataHub allows defining relationships between terms belonging to different glossaries. For example, here are two glossaries for Personal information and Classification

Personal Information Glossary

Personal Information Glossary

Classification Glossary

Classification Glossary

The Email term from Personal Information has an inherits relationship with the Confidential term from Classification.

Glossary Terms Personalization Email Configuration

Glossary Terms Personalization Email Configuration

Modeling relationships between terms provides powerful flexibility. If the enterprise decides that Email is actually a Highly Confidential term, it is very easy to change the inheritance of Email term and doesn’t require re-classifying data again at the source. Going back to the example of an enterprise ingesting data from external sources into AWS Glue tables. After running data classification manually or through automated means, terms from the standardized glossary in DataHub are applied to fields in these tables. At the table level, a Highly Confidential term may be applied if one of the fields inherits from Highly Confidential. As the datasets are loaded into Snowflake for further analysis and derived tables are generated, these tables should automatically be associated with the right classification terms. DataHub Cloud (the Acryl Data managed version of DataHub) allows defining actions in response to metadata changes.

Actions are pre-packaged units of functionality that are extensible through both no-code (config) and low code (Python) ways. They execute within the DataHub Cloud platform and are able to respond to changes happening in metadata within seconds. Using actions, one can listen for and react to any important change happening, such as schema changes, ownership updates, lineage edge changes, tags or terms getting added or dropped, documentation being edited etc.

The Term Propagator Action (available only in DataHub Cloud ) detects changes in classification terms on all the Glue tables and then using lineage information present within DataHub, automatically propagates terms to all the derived tables within Snowflake. This action can also navigate relationships between terms to ensure that traits from related terms can also be propagated forward.

View full demo of the feature:

Conclusion

Capturing complete lineage information within the Snowflake platform allows for end-to-end understanding of how derived tables are generated. DataHub Cloud’s term propagation action leverages this information to propagate governance information automatically across multiple platforms. Try out the lineage feature for Snowflake in the open-source DataHub project. Sign up for DataHub Cloud from Acryl Data which is currently in private beta.

Metadata

Data Governance

Data Lineage

NEXT UP

Governing the Kafka Firehose

Kafka’s schema registry and data portal are great, but without a way to actually enforce schema standards across all your upstream apps and services, data breakages are still going to happen. Just as important, without insight into who or what depends on this data, you can’t contain the damage. And, as data teams know, Kafka data breakages almost always cascade far and wide downstream—wrecking not just data pipelines, and not just business-critical products and services, but also any reports, dashboards, or operational analytics that depend on upstream Kafka data.

When Data Quality Fires Break Out, You're Always First to Know with Acryl Observe

Acryl Observe is a complete observability solution offered by Acryl Cloud. It helps you detect data quality issues as soon as they happen so you can address them proactively, rather than waiting for them to impact your business’ operations and services. And it integrates seamlessly with all data warehouses—including Snowflake, BigQuery, Redshift, and Databricks. But Acryl Observe is more than just detection. When data breakages do inevitably occur, it gives you everything you need to assess impact, debug, and resolve them fast; notifying all the right people with real-time status updates along the way.

John Joyce

2024-04-23

Five Signs You Need a Unified Data Observability Solution

A data observability tool is like loss-prevention for your data ecosystem, equipping you with the tools you need to proactively identify and extinguish data quality fires before they can erupt into towering infernos. Damage control is key, because upstream failures almost always have cascading downstream effects—breaking KPIs, reports, and dashboards, along with the business products and services these support and enable. When data quality fires become routine, trust is eroded. Stakeholders no longer trust their reports, dashboards, and analytics, jeopardizing the data-driven culture you’ve worked so hard to nurture

John Joyce

2024-04-17

TermsPrivacySecurity
© 2025 Acryl Data