BACK TO ALL POSTS

Hack Your Way to Data Quality with a "Metadatathon"

Metadata Management

Data Governance

How To

Data Discovery

Blog

Compliance

Data Catalog

Metadata

Stephen Swoyer

Mar 19, 2024

Metadata Management

Data Governance

How To

Data Discovery

Blog

Compliance

Data Catalog

Metadata

PayPal Product Manager Vaidehi Sridhar probably wouldn’t call herself a mastermind, but she did execute on a genius idea to improve data quality across PayPal’s sprawling ecosystem.

Even better, Sridhar and her team laid the groundwork for PayPal to federate data governance for its decentralized teams—as well as automate many types of tedious governance tasks.

The genius idea? A metadata hackathon, or “Metadatathon.”

A Data Quality Catch-22


Most organizations, especially large enterprises, recognize that a metadata platform is a foundational element of any advanced data ecosystem. To fill this requirement, PayPal selected Acryl Cloud, the fully managed SaaS offering based on DataHub, the open-source modern data catalog and metadata platform.

There was one slight problem. Like all large companies, a small share of PayPal’s data assets lacked useful documentation and context. And without rich, descriptive metadata and documentation, even the most powerful data catalog and metadata platform is starved for information—kind of like Sherlock Holmes attempting to solve a mystery…without knowing the facts of a case.

Moving the Needle on Metadata


Sridhar saw PayPal’s Metadatathon as a way to (wait for it) hack away at this deficiency. And it worked!

In just 15 days, PayPal’s teams documented and added metadata to almost 3,000 datasets and nearly 90,000 columns—a 20% increase in documentation and metadata.

The Metadatathon also helped acquaint teams with the power of Acryl Cloud, driving a 300% increase in active monthly users (from 400 to 1,200) in just two months.

By any criteria, it was a runaway success.

So how did Sridhar and her team pull this off? Let’s find out!

Metadatathon Organizing 101

The success of PayPal’s Metadatathon owes everything to the hard work of Sridhar and her team, who had to figure out the logistics of hosting and scaling a distributed hackathon across 20 internal teams.

At a minimum, this required streamlining data access for hundreds of self-serving users. But Sridhar’s team also had to establish clear criteria PayPal could use to determine a winner of the Metadatathon.

Because what’s a hackathon without a winner?

Here’s how Sridhar and her team went about it.

1. Strong top-down support and commitment from PayPal’s leadership.

Like it or not, initiatives of this kind always require strong, visionary backing from leaders. In PayPal’s case, leadership recognized that the documentation gap was a company-wide issue, and lent active support to the Metadatathon. This was communicated down the chain of command from the top, so team members would clearly understand the value and necessity of participating.

“Our leadership team was a great support because they understood that this is a company-wide problem that cannot be just solved by one person or one team,” - Vaidehi Sridhar
2. Interview and gather feedback from users.

PayPal started by identifying and prioritizing its most-queried and most searched-for data assets. This was just low-hanging fruit, however.

The company wanted to hear directly from the teams that either couldn’t find the data they needed when they needed it, or couldn’t easily use it, because it lacked adequate documentation.

The goals were to, (1) better understand the search and discovery experience; (2) identify hidden or unmet user needs; and (3) formalize a set of criteria PayPal could use to improve the usability of data, starting with guidelines and standards for documenting and enriching data.

3. Set clear boundaries.

Sridhar and her team set boundaries for both the duration of the Metadatathon—15 days, from start to finish—and how success would be defined.

The goal wasn’t to document and enrich all of PayPal’s data sources and assets, but to make a measurable dent. To this end, she and her team decided what to prioritize based on feedback from users, along with their analysis of the most popular queries and most searched-for data assets.

“We narrowed down our scope to certain critical data sets for which we wanted to get documentation as phase one, because we … are imagining this crowdsourcing event to be a continuous activity,” she explains, “you cannot get everything done the first time itself.” - Vaidehi Sridhar
4. Anticipate common questions, offer real-time resources.

Sridhar’s team used information gleaned from interviews to compile a list of frequently asked questions. For example, users wanted to know if they could attach image files to documentation in Acryl Cloud. (Yes.)

“These questions … [were] easy for me to answer because Acryl had all of these [features] already available” - Vaidehi Sridhar

via its user interface, Sridhar explains. Her team also held office hours and led demos during the hackathon:

“We had multiple demos with the teams, we had office hours, and we also had help from Acryl, which was patient enough to answer all our questions.” - Vaidehi Sridhar
5. Enable access for one and all.

Teams would need to be able to collaborate in documenting and/or adding context to data assets, including both tables and columns in source databases and derived datasets.

To permit access and collaboration at this scale, PayPal leveraged Acryl Cloud’s support for Domains to temporarily organize prioritized data sources and assets into a single domain, called “Metadatathon,” that it could govern using broad-based access controls.

This enabled teams to freely document and enrich metadata without the delays that would have been introduced with standard access controls and access request/approval workflows.

6. Track and audit contributions.

PayPal used Acryl Cloud’s Timeline API to track which team members edited what—and when. Tracking at this level was essential not just for transparency and accountability, but also for determining winners based on their actual contributions.

7. Lean into experiential learning.

PayPal collected feedback from teams before, during, and after the Metadatathon event. On top of the uses described above, this feedback was also used to provide a basis for future action—from developing guidelines for maintaining documentation to improving communication and collaboration among cross-functional teams to surfacing improvements that would be incorporated into DataHub and Acryl Cloud.

“We got a lot of feedback from our users and we are working very closely with Acryl and making sure [these suggestions] are all part of their roadmap, which we will eventually see.,”- Vaidehi Sridhar
8. It’s all about community.

Participation wasn’t restricted to PayPal’s internal teams. The company sought and received help from both the DataHub community and Acryl, the vendor behind Acryl Cloud.

Prior to the Metadatathon, Acryl helped PayPal solve several thorny issues, including how to organize and facilitate access to data and how to audit contributions. Acryl experts were also on hand during the hackathon to offer live, real-time assistance.

“Without their support, it would not have been possible to accomplish this success,” - Vaidehi Sridhar

Measuring Success

There was one final step.

After the Metadatathon concluded, Sridhar tasked a team of technical experts and data stewards with reviewing the newly added documentation. Their responsibility was to ensure the accuracy and relevance of the new contributions, as well as prune incorrect or redundant entries.

This team relied on DataHub's Timeline API to track edits and identify the most valuable contributions, which it used to evaluate the competing teams and individuals. Crucially, experts also sought feedback from users.

“After the event, our team spent a lot of time reviewing all this information and getting a sign-off from the users that what they are reading is actually good and useful for them,” Sridhar explains.

Only after reviewing both technical data and feedback from users did Sridhar’s team name the winners of the Metadatathon.

In a sense, this approach incorporated critical elements of the software development lifecycle (like quality assurance and user acceptance testing) to ensure the new contributions were both technically sound and, more importantly, useful to users.

Thorough documentation and rich metadata aren’t created in 15 days, which is why Sridhar envisions the Metadatathon as a rolling event: a periodic means, first, of chipping away at documentation debt and, second, of improving and enriching the quality of PayPal’s data.

By any conceivable metric, she says, the first Metadatathon was a huge success.

“We wanted accountability. I wanted more and more people to start embracing DataHub, with background around it,” - Vaidehi Sridhar
“One of the major objectives behind this hackathon was also to spread awareness, to start bringing more and more people to come to start using DataHub.” - Vaidehi Sridhar

Ready to Get Hacking?

Interested in organizing a Metadatathon of your own? Check out the full webinar to catch Sridhar’s full presentation!

PayPal's Data Journey: Driving Increased Data Awareness and Governance at Scale

Metadata Management

Data Governance

How To

Data Discovery

Blog

Compliance

Data Catalog

Metadata

NEXT UP

Governing the Kafka Firehose

Kafka’s schema registry and data portal are great, but without a way to actually enforce schema standards across all your upstream apps and services, data breakages are still going to happen. Just as important, without insight into who or what depends on this data, you can’t contain the damage. And, as data teams know, Kafka data breakages almost always cascade far and wide downstream—wrecking not just data pipelines, and not just business-critical products and services, but also any reports, dashboards, or operational analytics that depend on upstream Kafka data.

When Data Quality Fires Break Out, You're Always First to Know with Acryl Observe

Acryl Observe is a complete observability solution offered by Acryl Cloud. It helps you detect data quality issues as soon as they happen so you can address them proactively, rather than waiting for them to impact your business’ operations and services. And it integrates seamlessly with all data warehouses—including Snowflake, BigQuery, Redshift, and Databricks. But Acryl Observe is more than just detection. When data breakages do inevitably occur, it gives you everything you need to assess impact, debug, and resolve them fast; notifying all the right people with real-time status updates along the way.

John Joyce

2024-04-23

Five Signs You Need a Unified Data Observability Solution

A data observability tool is like loss-prevention for your data ecosystem, equipping you with the tools you need to proactively identify and extinguish data quality fires before they can erupt into towering infernos. Damage control is key, because upstream failures almost always have cascading downstream effects—breaking KPIs, reports, and dashboards, along with the business products and services these support and enable. When data quality fires become routine, trust is eroded. Stakeholders no longer trust their reports, dashboards, and analytics, jeopardizing the data-driven culture you’ve worked so hard to nurture

John Joyce

2024-04-17

TermsPrivacySecurity
© 2024 Acryl Data