The What, Why, and How of Data Contracts

Ah, Data Contracts — one of the buzziest topics in the data world. Despite the topic flooding my LI/Reddit/Substack/Medium feeds, I found myself repeatedly scratching my head, trying to make sense of the hype.

I am once again asking what is a data contract - Bernie Sanders

I wanted to get to the bottom of this, so I crowd-sourced questions and hosted an AMA with Chad Sanderson (one of the biggest proponents of data contracts) and Shirshanka Das (co-founder at Acryl Data) to talk about all things data contracts:

The What: What, exactly, is a data contract?
The Why: Why do data contracts matter? What are the core use cases behind them? What problems do they solve?
The How: How do we implement data contracts? How do we start building them into our data stack?

There’s a lot to unpack here — let’s dig in!

First thing first: meet the experts

Chad Sanderson, one of the most prolific voices in the data platform and quality space, runs the Data Quality Camp community. Chad writes at length (https://dataproducts.substack.com/) about data, data products, data modeling, and the future of data engineering and architecture.

Shirshanka Das is the CEO and Co-Founder of Acryl Data (https://www.acryldata.io/), the company maintaining the open-source DataHub project. He spent almost a decade at LinkedIn leading its data platform strategy and founded the DataHub project. He continues to lead the charge on DataHub’s developer-led approaches for modern data discovery, quality, and automated governance.

The What: Defining a Data Contract

Let’s start with the basics.

What, exactly, is a data contract?

At its core, a data contract is an agreement between a producer and a consumer that clearly defines the following:

what data needs to move from a (producer’s) source to a (consumer’s) destination
the shape of that data, its schema, and semantics
expectations around availability and data quality
details about contract violation(s) and enforcement
how (and for how long) the consumer will use the data

Data contracts clearly define roles & responsibilities

Data contracts are bi-directional: an effective data contract sets clear expectations for both the producer and consumer of data.

Even more, it holds both producers and consumers accountable for adherence to the contract and is frequently revisited and renegotiated as use cases and/or relevant parties evolve.

This ensures the producer reliably generates high-quality and timely data while enforcing how that data is used downstream. This could mean auditing who has access, how it has been shared with others, or how it has been used/replicated for unforeseen use cases.

Isn’t a data contract just a ________?

Data contracts vs. dataset DDL (Data Definition Language)

Dataset DDL defines the physical storage of data — what your technology will or will not accept as a new record within the storage layer.

While dataset DDL is undoubtedly a part of the data contract, it fails to capture semantic detail (what the data represents), data retention policies (how long the data can be stored), SLA/SLO requirements (when the data will reliably be available for consumption), and more.

Data Contracts vs. Data Products

Look at contracts as inputs to data products: a mechanism on which actual data products can be constructed and fulfilled.

A data product can have multiple data contracts, and multiple data products can rely on the same data contract(s).

The Why: Why should we care about Data Contacts?

Data practitioners’ workflows commonly include rapid iteration and prototyping to find specific slices and dices of data to address business needs. Whether building BI reporting tools, analyses, or training datasets for ML models, it’s expected that data practitioners prioritize speed to delivering business value over long-term scalability.

By the time a data asset/data product is deployed to production, it’s highly likely to be multiple steps of enrichment and transformation removed from its source. The numerous layers of abstraction make it difficult for original data producers to understand which fields/attributes are critical to driving business value.

Introducing a data contract for these prod-level assets is an effective way to align producers and consumers on the following:

technical schema requirements to be enforced upstream to minimize the impact of dropped columns, changes in data types, etc.
field- and dataset-level quality assertions to ensure high accuracy in output; no more “garbage-in, garbage-out”
Service Level Objectives to set guarantees of when the data will be available for processing
retention and masking policies to minimize compliance risk
in-scope business use cases to provide line-of-sight to data producers of how their resources are driving revenue

The How: Where do Data Contracts fit within our stacks?

Don’t overthink this one. You can introduce a contract anywhere you see a handoff between a producer and consumer. Keep in mind that you & your team may act as the producer *and* the consumer in your ETL pipelines.

No matter where that handoff happens, contracts should be version-controlled, easily discoverable, and programmatically enforced.

Some suggestions are to define your technical schema with Protobuf, Avro, or the like and store it within a registry. If you use Kafka or Confluent, the Kafka schema registry is a great starting point, but even GitHub works just fine to store contracts.

While you need a way to discover/catalog your contracts, you must also detect and flag violations and take action based on them. This means you must run monitors, programmatically prevent breaking changes, and isolate bad data for review.

Here are three ways to take action against violations:

The CI/CD workflow — Eg: evaluate and prevent schema-breaking changes before they are deployed.
On the data itself — If you’re using a stream processing system, you can check each data record to validate that it meets the contract’s expectations. Any contract violations are sent to an isolated queue for review, preventing low-quality records from entering the data product.
Through a monitoring layer — In this case, after the data arrives, you can look at the statistical distributions of the data and detect any unexpected changes in the shape of the data.

Making a Business Case for Data Contract

You manage the rest of your software as code. Why not your data?

This, Shirshanka shared, resonates with executive leaders — given they are already bought in on the idea. Focus on the principle of ‘managing data using software engineering practices.’

The most effective way to secure funding for data contracts is to take advantage of existing initiatives and implement them iteratively on a subset of the data stack.

Managing Data Contracts at Scale

The big challenge in managing contracts is less of a technical challenge and more of a social-cultural challenge.

You need to get people who don’t think about downstream data use cases to change their approach and consider playing an active engineering role around the data.

Here’s an approach Chad recommends based on his work at Convoy:

Step 1: Spread awareness

The first step is building awareness of how producers’ data is leveraged downstream.

Convoy had a data contract mechanism for defining column-level dependencies between data sources. Any time an engineer went to change a data source, they could easily see what impact that would have on downstream assets: what would potentially break, the use case, and how important it was.

That went a long way in helping engineers understand the impact of breaking the contract and generating accountability.

Step 2: Meet people where they are

At Convoy, a contract was implemented and defined through a schema registry and a schema serialization framework. Software engineers would use an SDK to define and push new versions of contracts. If backward-incompatible changes were detected, they surfaced in their GitHub flow.

Whenever possible, meet people where they are and introduce as little change to their existing workflows as possible. The more deviation from their current workflow, the harder it will be to scale.

Data Contracts and the Modern Data Catalog

The cost of creating a single data contract is non-trivial, and managing a large volume of contracts can quickly become challenging; you must ensure that you’re creating contracts on the most valuable data assets.

The data catalog and its underlying metadata graph can help you prioritize which assets require a contract by using the following:

data lineage to understand how often business-critical downstream assets reference a dataset
data quality assertions and profiling results to determine a dataset’s reliability

Companies like Optum, Saxo Bank, Zendesk, etc., already use this approach. If you’re looking for inspiration, check out how Stripe uses DataHub to solve their observability changes by encoding their data contracts in the Airflow DAGs.

Starting the Data Contract Journey: Advice and Recommendations

Start small

Start with valuable, revenue-generating use cases. Introduce constraints gradually. Start with one or two meaningful and easy-to-debug constraints and introduce more nuanced use cases over time.

Leverage what you have

Don’t look at data contracts as a net-new phenomenon. Maybe you’re already using dbt Tests or encoding quality checks within your Airflow DAGs — treat that as your starting point and build from there.

Phew, we made it through. I hope this cleared up a concept or two to help you get started with data contracts. Best of luck on your data contract journey!

Connect with DataHub

Join us on Slack • Sign up for our Newsletter • Follow us on Twitter