The best way to ensure that data works

A proven lesson from mature branches of engineering.

Aug 16, 2022

It's 2030, a celebration of one of the engineering marvels. I am about to depart on a cross-Atlantic trip in a supersonic airplane—Overture. A newly built aircraft can cover the distance from London to New York in an impressive 3.5 hours.

It's time to hit the runway.

The palms of my hands get a bit sweaty, knowing I will soon travel 1,800km/h (that's a kilometer every two seconds 🤯). At the same time, I rationalize that I am about to engage in one of the safest way to travel.

This is not by accident.

The commercial airline industry is well-known for its rigorous quality testing process. The stakes are very high, so a new airplane undergoes thousands of hours of test flights, where each possible scenario is thoroughly tested. Meticulous check procedures are applied before every flight, and periodic maintenance is done on each airplane at least every few hundred flights.

In the data analytics world, things are very different. Lives are not in danger, but many businesses struggle to tame their increasingly complex data and rely on their data teams to solve that challenge. Perhaps it’s time to learn a few lessons from airline engineers, who can build highly reliable systems much more complicated than today’s data analytics stack. They also know they’d stand little chance of succeeding without thorough testing.

There is no engineering without testing

In a quick visit to the Engineering entry on Wikipedia, we can find a well-articulated paragraph under the Methodology section:

“Engineers use their knowledge of science, mathematics, logic, economics, and appropriate experience or tacit knowledge to find suitable solutions to a particular problem.”

This sentence can be nicely applied to any kind of engineering—from mechanical to chemistry, construction or software. An additional point quickly follows:

“Engineers take on the responsibility of producing designs that will perform as well as expected and not cause unintended harm to the public. Engineers typically include a safety factor in their designs to reduce the risk of unexpected failure.”

The beginning of the sentence is particularly well articulated—Engineers take on the responsibility … to make sure that what we create and provide to others works reliably.

When I manage software engineering teams, I often say that:

Any user will appreciate an application that works over an application with more features that does not.

It’s a reminder of how crucial it is to keep a culture of high quality.

State of testing in the data world

It was refreshing to see the dbt core ship with tests a couple of years ago. It was a clear statement: testing is part of building an analytics system.

But we have a lot of work to do.

While many analytics teams write at least basic tests, mostly testing for not null, uniqueness, or set of accepted values, we don’t build a reliable system by testing values are not_null and unique only. We can and should do more. For example, dbt-expectations ships with 59 test types for various aspects of data distributions, freshness, volumes, or schema. With dbt-expectations, we can go a lot deeper into testing our data.

But I can hear you thinking: “I have 500+ models. It would be too many tests that could get prohibitively expensive to run. How am I supposed to do that?”.

The key is to test where it matters.

What to test?

After speaking with many data practitioners, it’s obvious there is a pattern of teams working on more tests. It's certainly an investment well worth it, but while at it, it's essential to formulate a strategy considering a nuanced trade-off between value (safer system) vs cost (time to write and run these tests).

As with most strategies, before going into the details, it’s worth establishing a few principled factors to decide by. Here are a few I use.

——

I think of (data) systems as the following equation:

output = logic(inputs)

In other words, the system's output is a logic applied to some inputs. The logic is entirely in your control—you made it. Inputs are external. They come from other systems that you often don’t control. These systems have their quality standard, which could be higher but also lower than yours. This highlights the first key area of testing: inputs.

Input changes that are unexpected to how you implement your system’s logic are a prevalent source of failures. But being external doesn't mean you can't stay in control. If you expect a data field to be a number, to be bigger than 0, to match other tables, to continually grow, to have a particular distribution, to be no older than one day, or to have values in a defined finite set, then test them. dbt-expectations has tools for all of these scenarios.

Test that system inputs match your assumptions when you implement your logic. Detecting cases when your beliefs are unmet will save your logic from failing.

This suggests that tests should concentrate as upstream as possible, helping you protect the boundary between the external and your systems.

When testing your logic (and, therefore, outputs), you can apply other important lenses: Where is the risk? What broken output will have a significant impact on your business?

While your data system could have hundreds of models, they are not all equally important. A typical business will have between a handful to a few dozens of vital assets that power a lot of critical reporting or operational use cases. These assets (and their inputs) should be thoroughly tested.

Identify a handful of your most critical models and write tests that verify the key characteristics of their data.

Additional considerations that have helped guide my test strategies for various systems are:

Test more what changes frequently—changes are the most common cause of failures, where incoming change breaks some implicit assumption about how something works. It's beneficial to codify it in the form of a test.
Test more what is hard to debug—there often are areas of a system you really don’t want to debug, like complex reconciliation calculations that require calculations across many fields.
Test more on assets with a large blast radius—to ensure you detect errors in areas that impact your business.

Besides strengthening your system, your tests have another great function: They are documentation, describing what guarantees others could expect from the data. They also help every other engineer to evaluate their changes and act with an additional safety of knowing that their change will not break your data. This is essential as your team scales, as everyone can't remember all nuances of the growing data landscape.

Make it actionable

While ownership is a topic that certainly deserves its own post, it's critical to mention it here, in the context of testing:

The tests need to be actionable. It is paramount that you explicitly state how the test failures are handled.

This doesn't mean you need expensive SLO management software. You can start simple. Have a conversation with your team and get answers to the following questions:

Who will be responsible for acting on each type of failure?
How fast do they need to act?
What is the process of fixing tests, how will you coordinate, and how will you communicate to the rest of the business?

Start simple and, over time, add a more nuanced approach with explicit owners, expectations on various test severities, or more thought through the triage and communication process. You can get there step by step; the key is to have a conversation and at least a basic alignment of how you commit to act.

——

This is, of course, quite far from the airline engineering testing process. Rightfully so, data analytics systems are (typically) not deployed in life-threatening situations. However, I still hope that we learn a lesson from airline engineers.

Failure is always just a matter of time, and without rigorous testing, it will go undetected and harm your system.

A well-structured testing strategy beyond basic not_null/unique tests is a pillar of data quality. You can start with a handful of models and build your test suite up, avoiding big-bang releases of generic tests on all your models that could quickly trigger alert fatigue.

Well-written tests can be used to understand the data, test changes before deployments, and test your system at runtime to detect inputs drifting away from your expectations. They help enhance the quality management across your analytics system's entire life cycle.

It will be an investment well worth it, assuming that increasing the quality of your data has a clear link to the success of teams that use it.

We now have the right tools, but we need to use them more.

petr@substack

Discussion about this post