Thought leadership

November 8, 2022

Data in Practice: Anomaly detection for data quality at Netflix

min read

Kyle Kirwan

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Netflix, the video streaming service that we all know and love, has 223 million subscribers in countries all around the world watching over 200 million hours of Netflix each day. If you assume that one hour of Netflix HD content is three GB of data, then Netflix is delivering over 6 petabytes of data to customers every single day. This data, once collected and aggregated, sheds light on the streaming experience from both the perspective of the viewer and that of the server.

Laura Pruitt is Director of Streaming, Platform, and Security Data Science and Engineering at Netflix. This blog post covers some of her insights into how the company maintains the quality of this core dataset.

How Netflix streaming works

Netflix has custom-built servers to hold video, audio, and subtitle files. These servers are distributed around the world, as close as possible to customers. The goal of this localization is so that when customers have to stream the data, it never has to go very far.

To outline the lifecycle of watching a TV show on Netflix: you’ve found something you want to watch, and your device sends a request to one of the servers asking for a piece of content. The server sends the first chunk of that particular video back to you, which your device then decodes and renders in real-time. As your device is decoding and rendering it, it also asks the server for more data, which it sends back. All of this is done in real-time.

While all of this is happening, Netflix is collecting a lot of information from both the device and the server. From the device side:

Who are you as a customer
What device are you streaming on?
How quickly did it take for that video to load?
Did you experience any errors or interruptions during the course of this playback?

From the server side:

What ISP was the server connected to deliver the content?
How many bytes did the server transfer?
How long did it take for those bytes to arrive at their destination?

All these raw logs land in Amazon S3, which is Netflix’s central data hub. From S3 the data is directed into additional services like Redshift, Kinesis, etc.

What Pruitt’s team does

Pruitt’s team runs ETL pipelines that use business logic and windowing, to process these raw logs into a dataset that is a unified view into both the customer experience and the network experience. This dataset sees several billion new records every day, and is a core dataset at Netflix.

In putting anomaly detection and data integrity checks on this dataset, Pruitt’s team had the following considerations.

Impact

This dataset is a very important dataset for Netflix. It is used to answer questions like and make decisions about:

Which partnerships to invest in
Which ISPs or devices can bring valuable partnerships to Netflix
Where to invest internal engineering resources
Where the service is seeing the most performance issues

“Any dataset should have a bare minimum of checks in place, but this is one that is being used by many different people and we are making pretty important decisions with it, so it makes sense to make additional investments in making sure the data is of high quality,” Pruitt said.

Data Integrity

In addition to the devices and the servers, there are several more data sources in this pipeline. Each of these data sources is a place where things can go wrong. Examples of data integrity issues that might pop up include:

Missing data
Unexpected datatypes
Unexpected NULLS
Malformed records which means you can’t parse out key-value pairs

Pruitt’s team found that it’s best to detect these sorts of data integrity issues before the ETL process (Netflix, it seems, chooses to monitor their data at the source. See our blog post about whether to monitor at source or destination). They do via a metadata service that gives them high-level metadata metrics on their tables, including:

Is the partition loaded?
How many rows are there?
What’s the min and max value that exists within that column
What’s the cardinality of that column?
If a certain amount of data is using thrown away during ETL processing, what is that percentage number?

Netflix has built reusable frameworks that are shared between data engineering teams and data platform teams to make sure that these basic, generic data quality issues are addressed on source table. For example, every time a service writes out data, the producer can audit it before it’s published to confirm that the main metadata metrics are looking good, before the data is made available for downstream consumption.

Business metrics

This data pipeline produces dozens of metrics that the company cares about, including things like:

Error rates
Customers’ consumption of Netflix

Additionally, these metrics often have extremely high dimensionality, due to the fact that Netflix operates in hundreds of countries and thousands of ISPs. This makes it challenging to figure out where things are when there are so many permutations.

For example, consider a business metric like the global playback error rate – the percentage of sessions that end in a fatal error for customers. Let’s say that the spike is actually caused only by Android phones in Brazil – Pruitt’s team needs to identify and annotate this before the CEO comes knocking on the door.

To deal with the high cardinality, Netflix relies on anomaly detection. Netflix pre-aggregates data to grains that they believe are meaningful (devices, countries) and sends that data to an anomaly detection service, which sends back data points they think are anomalous. This pre-aggregation is an effort to reduce the dimensionality of their metrics.

In terms of alerting, Pruitt's team started conservatively. It picked the top metrics that it cared about, and only alerted on those to the right people (over email).

Conclusion

At Netflix, data quality directly translates into informed decisions that impact our viewing experience and their business bottom line. The company has made a wise decision to invest in it.

share this episode

Resource

Monthly cost ($)

Number of resources

Time (months)

Total cost ($)

Software/Data engineer

$15,000

$540,000

Data analyst

$12,000

$144,000

Business analyst

$10,000

$30,000

Data/product manager

$20,000

$240,000

Total cost

$954,000

Role

Goals

Common needs

Data engineers

Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.

Freshness + volume
Monitoring
Schema change detection
Lineage monitoring

Data scientists

Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.

Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing

Analytics engineers

Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.

Lineage monitoringETL blue/green testing

Business intelligence analysts

The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.

Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing

Other stakeholders

Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.

Integration with analytics toolsReporting and insights

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Data in Practice: Anomaly detection for data quality at Netflix

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

How Netflix streaming works

What Pruitt’s team does

Impact

Data Integrity

Business metrics

Conclusion

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

Get AI Ready with Governance & Data Observability

AI for Data Observability: Designing for Privacy, Access, and Risk

AI is Reshaping CDO Leadership: What You Need to Know

Join the Bigeye Newsletter

Data in Practice: Anomaly detection for data quality at Netflix

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

How Netflix streaming works

What Pruitt’s team does

Impact

Data Integrity

Business metrics

Conclusion

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

Related posts

Get AI Ready with Governance & Data Observability

AI for Data Observability: Designing for Privacy, Access, and Risk

AI is Reshaping CDO Leadership: What You Need to Know

Join the Bigeye Newsletter