BACK TO ALL POSTS

The 3 Must-Haves of Metadata Management — Part 2

Metadata Management

Open Source

Data Engineering

Best Practices

DataHub

Maggie Hays

Oct 29, 2022

Metadata Management

Open Source

Data Engineering

Best Practices

DataHub

I’m back with another post on metadata must-haves. Last time, I spoke about Metadata 360 and how it combines logical and technical metadata to manage and use metadata effectively. Today, I’m going to focus on a metadata management principle that I’m personally very, very enthusiastic about: Shift Left.

In principle, Shift Left refers to the practice of declaring and emitting metadata at the source, i.e., where the data is generated. This means that instead of treating metadata as an afterthought (all too often the case!) and annotating it later, we emit metadata right where the code is managed and maintained.

This is important for two reasons:

1) It helps us meet developers or teams where they are — instead of forcing new processes or workflows upon them for the sake of documentation.

Developer Meme

2) It has a significant role to play in understanding the downstream implications of any changes or identifying breaking changes.

Developer Meme

DataHub and Shift Left

To understand this, let’s go back to the example of our friends at Long Tail Companions (LTC) that we spoke about in Part 1. (Missed it? Read it here: The 3 Must-Haves of Metadata Management — Part 1)

Long Tail Companions' Fragmented Data Stack

Shift Left: Metadata in Code

The LTC Team can use meta blocks within the schema YAML for their dbt model to define metadata at source, as shown below.

With this, the LTC Team can define a fully customizable meta block to capture the most critical metadata next to the code that generates the data, assigning:

  • asset ownership
  • model maturity status (Production or Development, for instance)
  • PII status
  • domain (common in organizations that are adopting Data Mesh)
schema.yml

This way, the owner of a dbt model, can focus on building out the model, assigning it to different domains, and assigning tags to it — all within code.

And the data catalog — DataHub in this example — can bubble it all up with all the associated context.

How DataHub surfaces metadata added at source in its UI

How DataHub surfaces metadata added at source in its UI

Let’s also look at another application of Shift Left — this time, with LTC’s Ecommerce team that works with Kafka and Protobuf.

The team can simply annotate their schema while adding it to their datasets (or topics as they are called in Kafka).

Shift Left: Schema Annotations

In the example of the Kafka Search Event above, you can see a few additional annotations marked as options, such as the

  • classification option (Classification.Sensitive)
  • team option (Ecommerce)
  • IP address field with a sensitive classification

This approach ensures that schema annotations live alongside Protobuf schemas — putting business context and business metadata in line with their schemas.

Shift Left: declare & collect metadata at the source

And on DataHub, searching for the Search Event surfaces individual elements from those schemas directly mapped into tags, terms, or documentation.

Additionally, the team can use schema linters to validate if the schema has the required annotations before pushing metadata artifacts to DataHub via their CI/CD pipelines.

I hope these examples using dbt and Kafka explain how the Shift Left principle can be tailored to different teams’ tools and development patterns — while ensuring the same discovery/surface experience within DataHub.

Shift Left for Impact analysis

Another important aspect of shifting left is moving focus leftwards towards production systems to understand the downstream impact.

And here’s why emitting metadata at source helps: it ensures that you have a robust knowledge graph with a reliable view of interdependencies and how different components work together. The correct data catalog can help you use this rich metadata for impact analysis.

DataHub’s Lineage Impact Analysis feature offers the ability to get a snapshot view of all the resources so individuals can proactively reach out to folks for conversations around breaking changes.

Dependency Impact Analysis in DataHub

Dependency Impact Analysis in DataHub

You can look at the lineage, understand dependencies, and even export all this information in a CSV.

Need any help understanding how you can use impact analysis in DataHub? Ask us on our Slack channel or check out the DataHub Lineage Impact Analysis feature guide.

That’s it from me for now…oh, and one last thing before I go, do check out Shirshanka’s excellent blog post on Shifting Left on Governance.


Connect with DataHub

Join us on SlackWant to Learn More? Let's Talk!

Metadata Management

Open Source

Data Engineering

Best Practices

DataHub

NEXT UP

Data Quality Should be Part of the Data Catalog - Introducing Acryl Observe

We didn’t go looking for an excuse to develop a data observability solution. There’s more than enough to keep us occupied in relentlessly improving the best data catalog on the planet! But the more experience we gained in working closely with Acryl customers, the clearer it became that data quality, data discovery, and data governance aren’t just complementary, but mutually reinforce one another. Acryl Observe provides data teams with everything they need to detect data breakages immediately, contain their downstream impact, keep stakeholders in the loop, and resolve issues fast—so that data teams can spend less time reacting and more time preventing.

John Joyce

2024-04-16

Metadathon - Why MetaData Matters

Vaidehi Sridhar, product manager at PayPal, came up with a clever solution: a hackathon—but for documentation and metadata. The goal of Sridhar’s and PayPal’s “Metadatathon” was to crowdsource the labor involved in documenting and adding rich context to the company’s distributed data assets. “Lack of documentation was one of the major problems most of our users called out,” Sridhar explains.

Stephen Swoyer

2024-04-16

Hack Your Way to Data Quality with a "Metadatathon"

PayPal Product Manager Vaidehi Sridhar probably wouldn’t call herself a mastermind, but she did execute on a genius idea to improve data quality across PayPal’s sprawling ecosystem. Even better, Sridhar and her team laid the groundwork for PayPal to federate data governance for its decentralized teams—as well as automate many types of tedious governance tasks. The genius idea? A metadata hackathon, or “Metadatathon.”

Stephen Swoyer

2024-03-19

Get started with Acryl today.
Acryl Data delivers an easy to consume DataHub platform for the enterprise
See it in action
Acryl Data Logo
Acryl DataHub
Acryl Observe
TermsPrivacySecurity
© 2024 Acryl Data