Thought leadership
-
February 15, 2023

Data in Practice: Systematizing data quality at Uber-scale

In the "Data in Practice" series, we talk to real data engineers who have put data reliability engineering concepts into practice, learning from their challenges and successes in the real world.

Liz Elfman
Get Data Insights Delivered
Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.
Stay Informed
Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.
Get the Best of Data Leadership
Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Uber revolutionized transportation by connecting millions of bikes, riders, drivers, and restaurants. Behind this transformation lies a complex data stack. In this blog post, adapted from this presentation at Meta’s Data Observability Learning Summit by Sriharsha Chintalapani and Sanjay Sundaresan, we look at some of the challenges Uber faced in maintaining data quality at Uber-scale, plus the solutions they implemented to tackle them.

History of Uber’s data infrastructure

Uber's data infrastructure has significantly evolved since the company's launch. In the early days, Uber had a monolithic data pipeline that was responsible for collecting, storing, and processing all data. At the height of this platform, there were 300,000+ unowned datasets.  More specifically, the data pipeline consisted of:

  • A sharded MySQL database as the “online database”
  • A CDC pipeline powered by Hive that took data from the online database and pushed it to the data lake (in a 24-hour process)
  • Once in the data lake, data was categorized as “raw tables”
  • A data warehouse team that would turn raw tables into dimension tables and fact tables
  • Utilities and tools on top of the data warehouse for data scientists' usage

The need to build a data observability platform

With the huge number of pipelines and datasets, several issues arose. In particular:

  • Data duplication: No one knew which data existed, so teams felt it was most convenient to create their own version of the data
  • No visibility into data lineage and freshness: No one knew when exactly certain data was landing in tables
  • Data quality: There was no way to gauge the quality of the data teams were seeing (e.g. in dashboards)

In the years of Uber’s hypergrowth (2015-2016), teams concentrated on scaling the data infra itself, rather than investing in the data product. Soon after, Uber realized that a stronger data foundation was a top priority.

Uber’s principles for data

Uber applied the following principles to arrive at a better data culture:

  • Data as code: Data is treated as code and is managed in a similar way to software. The artifacts are reviewed, and any schema change done in production goes through the review process. Producers of the data, as well as consumers of the data, are tagged during the review process. This approach makes it easier to track changes, version data, and collaborate with others.
  • Data is owned: This principle mainly focuses on data ownership. The data must be owned by the business or functional teams that use it. The teams must clearly define the intent of the data product and artifact, own it, and provide guarantees around the data. This approach ensures that teams are responsible for the quality of the data they use and they are motivated to improve it.
  • Data quality is known: Data quality is continuously monitored and measured. The SLA targets are used as part of the assertions. All datasets are categorized with tiering levels, which are defined as criteria to set default SLA values. This approach enables teams to identify and fix data quality issues quickly and easily.

With the implementation of these principles, Uber moved from a platform of self-serving tools to a more regulated, owned, and responsible data platform.

Data observability at Uber in 2021

Fast forward to the present day: Uber has built out a data observability platform with the following components:

Tiering

At Uber, not all data is equally important. The company implemented a tiering concept for its data assets (tables, pipelines, ML models, and dashboards). Tier 1 indicates an extremely important dataset and Tier 5 indicates an individually-owned dataset, generated in staging environments, without any guarantees.

After all datasets were tiered, the company identified 2500 Tier 1 and Tier 2 tables (out of 130k+) that were extremely important. That way, Uber could focus its efforts on ensuring the quality of the most important data, while still providing visibility into all data.

Databook, Uber’s data catalog

Uber's in-house catalog is called Databook. Databook makes data exploration and discovery much easier for Uber’s engineers, data scientists, and operations teams. serves as a user interface on top of dataset metadata like:

  • Quality signals
  • Tiers
  • Data asset owners
  • Products enabled by the data

Lineage

Databook also provides information about lineage, or the relationships between different datasets. Information around lineage helps engineers understand how data flows through the pipeline, from source to destination.

Uber's data quality system going forward

To ensure data quality, Uber implemented a data quality system. Once a certain dataset is labeled as Tier 1 or Tier 2, it automatically onboards into a set of data quality checks and foundational guarantees, ensuring that the data is:

  • Documented
  • Owned
  • Connected to PagerDuty on-call

These guarantees essentially mean that the data asset is treated like a service. Additionally, all Tier 1/Tier 2 data assets are monitored on a set of metrics including:

  • Freshness: measures how recent the data in a dataset is. It can be determined by comparing the timestamp of the data to the current time, or by comparing it to a known source of truth.
  • Completeness: measures how much of the expected data is present in a dataset. It can be determined by comparing the number of rows or columns to a known expected value. This metric also ensures that all data that is present in upstream is also present in downstream.
  • xDC consistency: measures the consistency of data across different data centers. It can be determined by comparing data in different data centers for the same key or by using a hashing function to compare data across data centers.
  • Duplicates: measures the number of duplicate records in a dataset. It can be determined by comparing primary keys or by using a hashing function to compare records.

Additionally, Uber allows users to set up custom checks on top of these standard metrics. This gives users the ability to define specific checks that are relevant to their use case and to monitor the data in a way that is most meaningful to them.

Conclusion

Uber's approach to data reliability and quality is built on the principles of data as code, data ownership, and data quality. To put these principles into practice, the company intentionally constructed processes and tooling for visibility into all data and metadata, with a particular emphasis on key pipelines and datasets.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.