Thought leadership
-
December 26, 2022

A gentle introduction to data contracts

Production services are no longer just generating data as a byproduct. Instead, data is the product and should be treated as such. This is the argument that data contracts make.

Kyle Kirwan
Get Data Insights Delivered
Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.
Stay Informed
Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.
Get the Best of Data Leadership
Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Production services are no longer just generating data as a byproduct. Instead, data is the product and should be treated as such. This is the argument that data contracts make.

They have been a hot topic recently, with Chad Sanderson of Convoy and Andrew Jones of GoCardless both writing lengthy blog posts cheerleading their usage. But are they actually worth building? In this blog post, we explore what they are, how they can be implemented, and their pros and cons.

What is a data contract?

Data contracts are API-like agreements between data producers and data consumers. Their goal is to export high quality data that is resilient to change.

In the data contract paradigm, instead of dumping data generated by production services into data warehouses, service owners decide which data to expose to consumers. Then they expose it in an agreed-upon, structured fashion, similar to an API endpoint.

As a result, responsibility for data quality shifts from the data scientist/analyst to the software engineer.

Example of a data contract

Imagine a rideshare application. Production microservices write into the `"rides", "payments", "customers", and "trip request" tables in the database. Over time, these schemas evolve as the business runs promos and expands into different markets.

With no action taken, these production tables eventually end up in a data warehouse. Subsequently, any machine learning engineer or data engineer consuming the analogous tables in the data warehouse has to rewrite data transformations upon schema changes.

With data contracts, data analysts and scientists don’t consume near-raw tables in data warehouses. Instead, they consume from an API that has already munged the data and produced a human readable event, like a “trip request." The trip request metadata will be attached (pricing, yes/no surge pricing, promo, payment details, reviews).

Pros of data contracts

1. Consumers of data don’t have to worry about recreating the business logic that generated it

The current ELT model, where data is dumped into data warehouses and then transformed in massive joins across different tables, replicates the business logic of the production services that generated the data in the first place.

Data contracts, on the other hand, expose semantic events that are not tied to the transactional database. They should remain compatible as the transaction database evolves. Downstream users no longer need to maintain matching logic and data models.

2. Since it’s a strongly defined schema, you can document it, version it, and have CI/CD on it

Schemas aren’t just items on Google Docs. They’re usually defined in JSON or Protos or some other type of templating language that can be checked in on Github, code reviewed, and gate-kept with CI/CD. This brings a level of transparency and standardization that was previously impossible to maintain.

3. Root-case analysis is easier when there is a data quality issue

With data quality efforts that focus on monitoring the data warehouse, even if it tells you that there’s a problem in your data, you don’t necessarily know why. While you can certainly monitor the lineage of tables to get a sense of the problem's location (Bigeye provides this as a feature), data contracts mean that data quality issues should never have the opportunity to travel downstream. I

Cons of data contracts

1. Difficultly in getting buy-in from software engineers

Since the burden of data quality/data transformation now falls onto software engineers instead of data engineers, implementing data contracts requires a process change. This change can be a tricky sell. Even if software engineers are willing, they may be unfamiliar with data modeling.

2. Difficulty in enforcing the data contract

In theory, data contract enforcement is a matter of good CI/CD. If it doesn’t pass , it doesn’t merge. In practice, tables within organizations are not always created through proper CI/CD. Instead, many tables originate during prototyping/exploration, and somehow over time, end up referenced by downstream services.

3. Data consumer needs may change

In theory, data contracts should be designed in a backwards-compatible way. In practice, they probably still need occasional modifications. For instance, using the rideshare example from above, the data contract can handle changes in the metadata of trip requests; new pricing algorithms, for example, or name displays. But what if the machine learning team suddenly needs information about food orders? That’s a new/different entity that would need a separate data contract established.

Implementing data contracts

While Sanderson and Jones agreed on the broad strokes of what data contracts mean and why people should use them, they outlined slightly different implementations at their employers.

At Convoy, Chad Sanderson follows these steps to implement data contracts:

  1. Come up with enterprise data model
  2. Teams that own production services define entities and events using Protobufs
  3. Events that occur to these entities are published to Kafka (pub-sub service)
  4. Teams consume data directly from Kafka

At GoCardless, Andrew Jones follows these steps for data contract implementation:

  1. The producing team uses JSON to define the schemas for the data they want to make available
  2. They categorize the data and choose their service needs
  3. Once the JSON file is merged into Github, dedicated BigQuery and PubSub resources are automatically deployed and populated with the requested data via a Kubernetes cluster
  4. The consuming team gets their desired data from their dedicated BigQuery

As you can see, both GoCardless and Convoy make use of the same basic ingredients in creating:

  • Definition of entities and events
  • Contract defined with some templating language
  • A pub-sub system to handle events

What’s the difference between data contracts and data SLAs?

Here at Bigeye, we’ve talked a lot about data SLAs, and you might be wondering what the difference is between data SLAs and data contracts.

As a reminder, SLAs are agreements between the producers and consumers of a service that set performance expectations for that service. Data SLA’s are agreements between the producers and consumers of data that set certain metadata expectations for that data, e.g. freshness and accuracy.

Data contracts complement data SLAs. While data SLAs guarantee meta-properties about the data, data contracts guarantee what the data actually is.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.