BACK TO ALL POSTS

Metadathon - Why MetaData Matters

Metadata Management

Stephen Swoyer

Apr 16, 2024

Metadata Management

Online payments powerhouse PayPal had a pretty common problem.

Its teams couldn’t find the data they needed to do their work, either because it was siloed across separate environments, wasn’t properly documented, or lacked lineage and other essential metadata.

So Vaidehi Sridhar, product manager at PayPal, came up with a clever solution: a hackathon—but for documentation and metadata. The goal of Sridhar’s and PayPal’s “Metadatathon” was to crowdsource the labor involved in documenting and adding rich context to the company’s distributed data assets.

“Lack of documentation was one of the major problems most of our users called out,” Sridhar explains.

Two Sides of the Same Coin

The Metadatathon would not only deliver a transformed data discovery experience for PayPal’s users, it would also improve data quality—while (at the same time) simplifying some of the more tedious aspects of data governance for Sridhar and her team. Anyway you looked at it, it was a win-win.

After all, data discovery and data quality are both two sides of the same coin. You can’t discover data if it isn’t well-described—and you can’t use data unless it is both well-documented and of high quality.

In a very real sense, then, description and documentation are integral elements of data quality.


One Platform for Data Discovery, Observability, and Governance

PayPal had standardized on Acryl Cloud, a fully managed SaaS offering based on the open-source DataHub project, to provide easy-to-use data discovery and collaboration capabilities for its teams.

But on top of enabling a rich, self-serve discovery experience, Acryl Cloud also provided automated discovery and classification capabilities, best-in-class data observability features, and built-in support for data contracts and data products. Combining all of this into a single platform equipped Sridhar and her team with the capabilities they needed to monitor and maintain data quality across PayPal’s sprawling data ecosystem—as well as better understand, manage, and govern this ecosystem.

But first, PayPal’s users needed to discover on their own what a modern data catalog and metadata platform like Acryl Cloud could do for them. “One of the major objectives behind this hackathon was also to spread awareness, to start bringing more and more people to come to start using [Acryl Cloud],” she said.

Laying a Solid Foundation for Data Discovery

For PayPal to achieve these goals, it would first need to enrich its data assets with meaningful and consistent documentation and metadata. This was necessary for several reasons, including:

  • “Good” documentation and rich metadata would equip data leaders and practitioners with the knowledge required to understand PayPal’s data sources, their structures, and the data management and governance processes that should apply to them.
  • Lineage metadata is especially critical, not just for regulatory compliance, problem resolution, impact analysis, and change management, but also for the work of data practitioners—especially PayPal’s data scientists—who need to understand the provenance and history of data assets, along with their dependencies on or relationships to other data assets.
  • “Good” documentation also includes information about ownership, which helps accelerate problem resolution and makes it easier to coordinate on issues of data use and governance.
  • “Good” documentation also encompasses field notes, annotations, and other collateral, including diagrams or drawings. This content aids data leaders, practitioners, and consumers in understanding, properly using, and (when necessary) modifying data assets.
  • Together, documentation and metadata provide essential information and context about the proper handling of data; its sensitivity; pertinent usage, sharing and movement restrictions; retention period (if any); and the appropriate procedures for destroying it.
  • Finally, “good” documentation and rich metadata promote both reusability and reproducibility, which are important not just for operationalizing and maintaining PayPal’s production analytics and ML/AI solutions, but for demonstrating fairness, ethical alignment, and compliance.

An Object Lesson in Why Metadata Matters

As Sridhar saw it, the Metadatathon would deliver other, “soft” benefits, too—like fostering community and accountability among PayPal’s cross-functional teams, while also providing an object lesson in the importance of metadata and documentation. This was all part of her master plan.

First, she anticipated, the Metadatathon would provide an object lesson in the value of good documentation and rich metadata. Participants would get to witness this in real time, with assets they’d tagged and documented becoming discoverable in Acryl Cloud, automatically profiled and classified.

Plus, they would receive feedback and encouragement from peers on other teams, who would be able to discover, explore, and understand their data assets. This hands-on experience would make what had been an abstract concept—“metadata management”—concrete and actionable for PayPal’s teams.

Second, there was the practical, utilitarian aspect: like a barn-raising bee, or a Habitat for Humanity build, the Metadatathon would accomplish in days what it would take a dedicated team months—or longer—to complete. PayPal could rapidly populate Acryl Cloud with rich, contextual, metadata as teams created or improved the documentation specific to their data assets. Not only would this metadata be ingested and cataloged by Acryl Cloud, but it would also provide a firm foundation for both responsible data discovery and usage and metadata-driven data management and governance.

Third, and arguably most importantly, there was a social or communal aspect: as with a Habitat for Humanity build, PayPal’s Metadatathon enlisted people to work together as part of a collective, concerted effort to accomplish a specific goal. Teams were empowered to take time out from their day-to-day tasks and responsibilities to work toward this common goal. This collective effort would nurture a sense of community and shared purpose, concretely demonstrate the importance of well-documented, richly contextual data assets—while also facilitating cross-functional collaboration and knowledge sharing among teams. By working together in the Metadatathon, teams would bridge silos, setting a precedent (and laying a foundation) for collaborative projects and initiatives across PayPal.

Moving the Needle on Data Quality, Too

There was one other dividend, too: The Metadatathon would surface data quality issues, both known and unknown. From the perspective of both self-service discoverers and governance leaders, poorly documented data sources or datasets are ipso facto poor-quality data. Unlabeled datasets are ipso facto poor-quality data. Datasets that lack detailed lineage metadata are ipso facto poor-quality data.

But using Acryl Cloud to search for, discover, and explore datasets and other assets would also surface incomplete, inconsistent, corrupted, and stale data, too. And users would get a hand’s on feel for how discovery, data quality, documentation, and rich metadata are bound up with one another.

Basically, data assets that lack documentation and rich metadata aren’t:

Traceable. You can’t trace a dataset back to its origins, so you can’t understand what was done to it, by whom or what, and for what purposes. But you also can’t identify at what point it became corrupted, inconsistent, or stale. Correcting problems with datasets entails tedious, laborious reverse engineering.

Accountable. Without clear documentation of a dataset’s ownership or custodianship, it’s difficult to determine who created it in the first place, and who is responsible for maintaining it.

Repairable. When you don’t have information about the structure, source, lineage, or freshness of a dataset, “fixing” it is like solving a puzzle without a reference image. Much more difficult, actually.

Reliable. The data in the dataset might actually be reliable, but you don’t know this. And it’s nearly impossible to establish its reliability without access to documentation metadata describing the circumstances of its creation, conditioning and preparation, and (not least) its purpose or intended use.

An Exciting Start

Metadata hacking isn’t a one-and-done thing. Sridhar anticipates holding a rolling series of Metadatathons to maintain and improve PayPal’s documentation. A modern data catalog and metadata platform like Acryl Cloud make this much easier, automatically integrating with PayPal’s data sources, automatically monitoring datasets, charts, workbooks, dashboards, and other assets.

Metadata Management

NEXT UP

Governing the Kafka Firehose

Kafka’s schema registry and data portal are great, but without a way to actually enforce schema standards across all your upstream apps and services, data breakages are still going to happen. Just as important, without insight into who or what depends on this data, you can’t contain the damage. And, as data teams know, Kafka data breakages almost always cascade far and wide downstream—wrecking not just data pipelines, and not just business-critical products and services, but also any reports, dashboards, or operational analytics that depend on upstream Kafka data.

When Data Quality Fires Break Out, You're Always First to Know with Acryl Observe

Acryl Observe is a complete observability solution offered by Acryl Cloud. It helps you detect data quality issues as soon as they happen so you can address them proactively, rather than waiting for them to impact your business’ operations and services. And it integrates seamlessly with all data warehouses—including Snowflake, BigQuery, Redshift, and Databricks. But Acryl Observe is more than just detection. When data breakages do inevitably occur, it gives you everything you need to assess impact, debug, and resolve them fast; notifying all the right people with real-time status updates along the way.

John Joyce

2024-04-23

Five Signs You Need a Unified Data Observability Solution

A data observability tool is like loss-prevention for your data ecosystem, equipping you with the tools you need to proactively identify and extinguish data quality fires before they can erupt into towering infernos. Damage control is key, because upstream failures almost always have cascading downstream effects—breaking KPIs, reports, and dashboards, along with the business products and services these support and enable. When data quality fires become routine, trust is eroded. Stakeholders no longer trust their reports, dashboards, and analytics, jeopardizing the data-driven culture you’ve worked so hard to nurture

John Joyce

2024-04-17

TermsPrivacySecurity
© 2024 Acryl Data