Thought leadership

August 16, 2022

Getting started with dbt tests

min read

Kyle Kirwan

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

When running data build tools (dbt), you might want to perform simple validations. Perhaps it's worth checking that a source table has successfully transformed into a model, or that the source table doesn’t contain too many nulls.

One great way to validate is through testing. Testing is a supported feature in dbt, where you create test cases for your transformations. Through tests, you catch potential issues earlier, saving you the effort of fixing the issue downstream.

In this guide, you will test an ELT pipeline project based on publicly available COVID-19 data from Google BigQuery Public Datasets Program. We’ll cover some aspects of creating and running dbt tests on the transformations. We'll cover:

Overview of testing in dbt
dbt core built-in tests
dbt custom tests

Setup

Create a Snowflake warehouse (you get up to $400 in credits without a credit card)
Extract and load COVID-19 data from Google BigQuery Public Datasets Program
using Airbyte into the Snowflake warehouse
Install DBT Core/DBT CLI
Clone this Github repo as your dbt project
Connect DBT Core to your Snowflake Warehouse by creating a profiles.yml in your dbt project and configuring it with your Snowflake warehouse information. You can copy the
sample_profile.yml file currently in the repo.

In this ELT pipeline, the Extract and Load steps are already completed using Airbyte, with an additional Normalization step that transforms the resulting json blob fields to columns. This means that you can jump right into the Transformation step using dbt. There are two types of dbt tests:

built-in tests: Built-in tests are the predefined generic tests that come with dbt Core and are usually defined as properties in your schema.yml file in the models folder. This is why they are often called schema tests.
custom tests: Custom or bespoke tests are written in the tests folder and take the form of one-off SQL SELECT queries that validate the data based on returned values. Besides those two tests you can also implement tests from dbt packages, such as dbt-expectations and dbt-utils.

When writing dbt tests, it is important to define them in a meaningful and qualitative way based on the data and requirements. The value you get from testing is only as good as how well you define your tests.

Following, you will utilize both built-in and custom tests.

Built-in tests

In the schema.yml file of the models folder of your dbt project, you define properties using dbt’s built-in tests on data sources and models. dbt can run four different types of tests:

Uniqueness: tests the column has unique values.
Not null: tests the column does not have null values.
Referential integrity: tests that the column values have an existing relationship with parent reference table.
Accepted values: tests that the column contains only defined values as specified.

In your project, define a test on the data sources for uniqueness and not null on their respective hashID column.

To test the specific test on the source table, use the following command: dbt test --select source:raw_covid19.stg_epidemiology

dbt test --select source:raw_covid19.stg_epidemiology

Using the dbt test command without any arguments will execute all the test cases and output a summary result. The result of a test can either be PASS, WARN, ERROR or SKIP. Below is the result of the test run for the specific data source:

If the run leads to an error, dbt will output the error details. Below is a screenshot of running a test with an error:

To test referential integrity, you use the field name relationships. The example below checks that for each location_key value in the base_epidemiology table, there exists another value under the location_key column in the base_index table.

For the accepted values test, you test that the location_key has only the specified values in the predefined range. Notice that the severity level has been configured to WARN. This means that instead of an error exception, it will display a warn with additional details.

In the example above, if any of the values are not United States, Canada, or Mexico, dbt tests will give a WARN during test run. In the example below, 244 records did not match the accepted value test criteria:

Custom tests

Custom tests are defined in the "tests" folder of your dbt project. They are written test assertions using dbt’s SQL SELECT statements. Custom tests pass if they don't return any data; otherwise the test outputs a non-passing error. You see this illustrated with three questions about the transformation that can be answered using custom tests:

1. Is every record from the epidemiology source table transformed into the model?

-- Test assertion for all records transformed. select location_key, date from {{ source('raw_covid19','stg_epidemiology' )}} except select location_key, date from {{ ref('covid19_model')}}

2. Is every value from the epidemiology confirmed cases greater than zero?

-- Test assertion for cumulative confirmed cases greater than 0. select location_key, date, sum(cumulative_confirmed) as total_cumulative_confirmed from {{ source('raw_covid19', 'stg_epidemiology' )}} group by location_key, date having not(total_cumulative_confirmed >= 0)

3. Is every value from the epidemiology cumulative cases greater than new cases?

-- Test assertion for cumulative confirmed greater than new confirmed cases select location_key, date, sum(new_confirmed) as total_new_confirmed, sum(cumulative_confirmed) as total_cumulative_confirmed from {{ ref('base_epidemiology')}} group by location_key, date having (total_new_confirmed > total_cumulative_confirmed)

To run only the dbt custom tests directly you can use the command with the selector argument:

dbt test --select test_type:singular

To illustrate a non-passing test, you can run the statement below to erase records from the dbt model table in the data warehouse:

-- Remove records from dbt model DELETE FROM COVID19_DATABASE.AIRBYTE_SCHEMA.COVID19_MODEL WHERE Date = '2021-03-23'; -- number of rows deleted: 5,522

Then re-run the custom tests, which should show FAIL for the particular test checking if all records transformed:

Custom tests are good for testing business transformations. In the dbt Lineage graph below, the tests defined are included with their dependencies. When executing the dbt test or dbt run commands without options the custom tests are also executed.

Conclusion

This wraps up the beginner’s guide to dbt testing. If you’re interested in digging deeper into dbt tests, read on here:

share this episode

Resource

Monthly cost ($)

Number of resources

Time (months)

Total cost ($)

Software/Data engineer

$15,000

$540,000

Data analyst

$12,000

$144,000

Business analyst

$10,000

$30,000

Data/product manager

$20,000

$240,000

Total cost

$954,000

Role

Goals

Common needs

Data engineers

Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.

Freshness + volume
Monitoring
Schema change detection
Lineage monitoring

Data scientists

Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.

Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing

Analytics engineers

Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.

Lineage monitoringETL blue/green testing

Business intelligence analysts

The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.

Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing

Other stakeholders

Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.

Integration with analytics toolsReporting and insights