Dbt Unit Tests

Jan 28, 2024 · 768 words · 4 minute read

dbt has recently introduced unit tests in addition to their regular tests. According to dbt labs’ release plan, unit tests are to be launched with dbt version 1.8, which is scheduled for release this spring. In the meantime, you can check out the main branch of dbt-core, and run unittests with postgres.

Tests in dbt are run against data in the database, which means that the database becomes a dependency for the tests. A test that passes on the development environment’s data could fail on the production environment’s data, and a test that failed yesterday might pass today because the underlying data changed.

Unit tests look and feel just like unit tests in the rest of the software ecosystem, and allow developers to define their own data to test against. All inputs to a dbt model, along with the expected correct output, are defined in the unit test. The defined output is compared with the actual result, and the unit test only passes if the two are identical.

As with much else in dbt, unit tests are defined in YAML. The main elements of a unit test are the definition of input for upstream references that the dbt model refers to, and an output that represents the table the model is meant to produce with the defined input. The unit test compares the defined output cell by cell with the actual result of the model with the inputs provided.

The strict comparison of actual and expected results can be both an advantage and a disadvantage, but with thoughtful use, it is primarily an advantage. The sorting of rows is not compared, since it is random in SQL anyway. It’s also not necessary to define input or output for columns that are not relevant to the test. Input columns that are not specified become null, and the test only compares the columns present in the output with the actual result. In other words, both the input and output for the test can be a subset of the actual input and output of the model. This makes it easy to define multiple small unit tests with a few columns, rather than having to define potentially hundreds of columns.

An example test, where we test a model named country_report, that does a join of two tables, and a count group by:

version: 2

models:
  - name: country_report

unit_tests:
  - name: unit_report # this is the unique name of the test
    model: country_report # name of the model I'm unit testing
    given: # list of inputs to provide as fixtures
      - input: ref('seed_customers')
        format: csv
        rows: |
          ix,name,country
          1,Frodo,middle earth
          2,Mr. Tumnus,narnia
          3,Lucy,narnia          
      - input: ref('seed_countries')
        format: csv
        rows: |
          name
          middle earth
          narnia          
    expect: # the expected output given the inputs above
      format: csv
      rows: |
        country_name,customer_count
        narnia,2
        middle earth,1

The unit test has a unique name, unit_report. Normal tests in dbt also has unique names, you just don’t see them that often. But in any case you can run dbt test -s unit_report to run this specific test, or dbt test -s type:unit to run all unit tests.

The seed_customers model has the name and country of customers. The seed_cuntries model contains country names, population and continent. Both of these models are “mocked” by us in the test, including some fictional customers from fictional countries. We only specify one column in the seed_countries table, because we only use the country name in our test. No need to provide data that isn’t needed. Even if the model is modified to include the population of the country, we don’t have to modify our unit test. The extra column it outputs does not fail the test.

As Djikstra said, tests can not prove the absence of error, just their presence. But short of that, unit tests can go very far in showing that critical functionality in a model is accurate. Complex business logic, exception handling for things like duplicates and missing values can be tested even when the database is empty. This also provides a far better opportunity to use test-driven development methodology, a perhaps counterintuitive approach where tests are defined before the code, and the code is considered complete when the tests pass. This ensures that tests are more comprehensive, rather than an afterthought.

An additional bonus is that these unit tests are very quick to run. Regular dbt tests can be heavy to run due to large data volumes, unit tests usually consist of just a few lines and correspondingly take much less time.

A small dbt project with some of my unit test escapades, including the full example above: https://github.com/radbrt/dbt_unittest.