Thought leadership
-
February 27, 2023

Data observability doesn't fix anything

Working on data quality used to mean simply "fixing the issues." Today, the understanding is more nuanced. Data quality techniques don't actually fix anything. Why not? Read on.

Kyle Kirwan

The rising interest in data quality goes beyond data and engineering teams. Search volume in 2022 for the term, for example, was up 30% compared to prior years.

While the concept of “data quality” is decades old, newer techniques have helped to formalize the practice; data observability and testing, for example. Historically, data quality was understood as a technical discipline that would “fix the issues.” Today, that understanding has evolved. Data quality techniques don’t aim to fix, unlike the previous paradigms. Why is that? Let’s dig in.

The traditional approach to data quality

Legacy tools used to change data. Several years ago, the term “data wrangling” was frequently bandied about. That term would refer to a data scientist going through several steps to clean up the data, change the shape of it, and ready it for deployment. Data wrangling might use Trifacta, or Pandas (Python library), and the DPLYR (a set of libraries for R). That “wrangling” would also include data quality tools and processes.

This process would address data quality at the last possible moment along the data pipeline. That is, immediately before it is going to be used for analysis. That’s one problem.

Another problem lies in master data management (“MDM”). Traditional MDM systems aimed to create a clean and central copy of data. Cleansing / cleaning is key to MDM, because the main copy must be clean. IBM Datastage, Informatica, and SAP all use terms like “cleanse”, “enrich”, and “validate” in MDM-oriented data quality.

But the problem is that these techniques aim to clean up the master copies. Making data more reliable and accurate in the long-term requires solving the root causes, not just a problem with one copy of the data.

The modern approach: Pipeline testing

Pipeline testing emerged at companies like Uber, Netflix, Intuit, and AirBnB as a way to identify problems within their data pipelines. Pipeline tests check data for various factors, like freshness and completeness. These tests sometimes stop the pipeline, but are rarely used to actually change the data.

Pipeline testing is analogous to the testing conducted in software engineering. Unit tests are used to identify symptoms of problems. Then, the engineer can track down the root cause, solve it, and rerun the unit test to confirm the solution worked. The tests themselves aren’t used to modify what the program is actually doing.

That brings us to the one big limitation that many large data teams ran into as they conducted pipeline testing: that of scale.

The promises of observability

Observability is used in software engineering to detect problems with the live performance of infrastructure and applications. If software goes down, it doesn’t matter if a unit test should have prevented it or not, somebody needs to know. Observability solves that problem.

In data engineering, data observability fills a similar role for the operational health of the pipeline and the quality of the data inside. If anything goes wrong, a data observability platform lets data engineers and data scientists know where, how, and ultimately why the issue occurred.

From there, the solution happens at the root cause. That might mean fixing a web form that doesn’t validate for manual data entry errors. Or, it might mean replacing an expired API token, or fixing a bug in a dbt model, or whatever else caused the pipeline issue or data quality problem.

In all of these cases, the data teams aren’t looking to “cleanse” the data. That’s not how they see data quality monitoring. They care less about a singular central data model looking pristine (even if under the hood it relies on covering up various potholes). Their priority lies in ensuring that the pipeline is running smoothly day-to-day.

Final thoughts

MDM is an important technique, especially in enterprises merging unmatched data from multiple lines of business, often built over the course of multiple acquisitions. In these cases, cleansing techniques might be required when hunting down and solving real root causes isn’t always practical.

But for everyone else, finding the root cause and fixing it ASAP is the key to data quality. It’s the key to preventing those quality issues from haunting the organization for weeks, months, or years to come.

Data observability works when engineers and data teams can partner to create a problem-solving culture, and not consider that “another team’s problem.” While data observability itself doesn’t fix anything, it works in concert with a robust data quality culture to fortify data management across the board.

By putting a detection and resolution plan in place, data teams make the pipelines themselves increasingly anti-fragile over time. That robustness leads to higher reliability and less toil for data science and analytics teams who put that data to work.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.