7 ways to level up your dbt tests
Dbt tests are a great first step for organizations that want to improve data quality and reliability. After a time, a second follow-up step becomes necessary. That's when observability comes in.
Get the Best of Data Leadership
Stay Informed
Get Data Insights Delivered
When a company using dbt starts to experience data pipeline problems, they usually turn to dbt tests. That makes sense; it's the logical first step in mitigation.
Dbt tests work wonderfully for organizations that want to improve data quality and reliability. But after a time, a second step becomes necessary. Testing has inherent limitations; namely, tests provide information solely from the areas that are tested. But what about the areas you haven't tested, because you don't have the time or resources?
Ideally, at some point, organizations shouldn’t just test their data pipelines. At a certain scale, organizations need to fold in some data observability. In this blog post, we’ll cover 7 ways to go deeper with dbt tests. Plus, we'll cover the indicators that it's time to add a data observability component to your data pipeline.
1. Understand the different types of dbt tests
There are two types of dbt tests:
Data Tests are specific queries that you run against a specific model. These live in the “tests” folder of your dbt project.
Schema Tests are generic tests that can be configured and applied to multiple models. Out of the box, dbt has four schema tests that can be immediately used: unique, not_null, accepted_values, and relationships. If you want to write your own custom schema tests, these are written as macros with the prefix test and stored in the `macros` folder.
If you have one model that you want to validate in a specific way, you can simply write a data test in the tests folder.
If you find yourself copy+pasting that query and configuring it for multiple models, then you may consider changing this from a data test over to a schema test so that you can write the test in one place and apply it to multiple models.
2. Use plugins and packages
Dbt has a rich ecosystem of packages that integrate with it, including:
- OpenLineage - an open platform for collection and analysis of data lineage. It tracks metadata about datasets, jobs, and runs. OpenLineage consumes data produced by dbt runs.
- dbt_expectations - a dbt plugin that allows users to deploy GE-like tests in their data warehouse directly from dbt
- Unit Tests for Macros
3. Schedule dbt tests for all dbt runs
Ideally, if you have a dbt job that runs daily, the associated dbt tests should run on the same schedule. To do this, you can configure your dbt job to have multiple commands, including tests, running, freshness checks, and more. If any step fails you can get an email and a slack notification.
4. Run dbt tests on a model change/new PR
Whenever a PR is created in your database repo with a migration, you should use dbt tests to verify that the changes won’t break your dbt models before merging the PR.
The best practice here is to use data model blue-green deployment. This means that your CI/CD tool should first do the `dbt run` and `dbt test` commands targeting a staging database environment OR a staging schema within the production environment. Then, if the tests pass, do `dbt run` against the production environment or the production schema.
To break this down:
Blue-green data model testing
- Run `dbt run` in staging (either a separate staging schema or a separate staging database)
- Run `dbt test` in staging
- If all the tests pass, then run `dbt run` in prod
You should make sure that your staging environment is fully replicated so that the initial staging `dbt test` results are an accurate proxy for what will happen in production.
5. Run dbt tests on your source models
To catch potential problems as early as possible, you should always run “dbt test” against your source models first:
- Run dbt tests against the source
- Run the dbt job
- Test the resulting outputs
In other words:
dbt test --models source:[sources]
dbt run -m [models]
dbt test -m [models]
6. Distribute dbt test results
To make sure that you are actually aware when something in your dbt transformations goes wrong, set up an interactive way to display and visualize your dbt test results.
There are a few tools, like re_data and elementary, that do this for you, or you can also spin up something custom: The data team at Snapcommerce, for example, leveraged dbt artifacts and the query log to build reporting and alerts on dbt model/job runs and performance.
This may also be a good point to begin considering a more generic data observability tool like Bigeye.
7. When appropriate, add observability into your data pipeline
While testing and observability have the same ultimate goal – detecting problems – they go about it in two different ways.
Testing represents a human with knowledge of the data, expressing a condition that they know should be true. It is opinionated.
Observability, on the other hand, is unopinionated. It is purely asking, do we know the state of our data system at all times? Do we know how the state is changing? And will we know if the state changes in a way that is concerning?
A good analogy for observability is the dashboard on a vehicle. You can do without it, but most people would be uncomfortable not knowing how fast they're going, or whether the engine light is on. On the other hand, it doesn’t make sense to test the speed or the engine light setting at all times.
In conclusion: Optimize dbt testing sophistication
It’s increasingly apparent that data engineers are going to some trouble to bridge the gap between dbt tests and a more traditional observability solution. Even so, this kind of hacked-together setup only gives you adequate observability into dbt – you remain ignorant of the other parts of your data stack.
Dbt is super popular, and you should definitely leverage dbt testing. However, as you scale your tests, you may realize that the set of things you want to write tests for are infinite. There’s only so much mileage you can cover. That realization is a warning sign that it’s time to catch the longtail of everything else that can go wrong, even beyond your set of tests. To catch things you don’t want to write tests for, and take your analysis to a new level of sophistication, layer on an observability solution. Data observability platforms look at the final data at rest, after everything has been transformed.
Monitoring
Schema change detection
Lineage monitoring