A Taxonomy of Test Data

Oct 23, 2024 · 679 words · 4 minute read

The topic of test data comes up from time to time, and is plagued by the fact that test data can mean many different things. And that these things don’t have names.

A test data topology 🔗

I have had the idea of a taxonomy of test data for a while. Like most taxonomies it won’t catch all nuances, or edge cases. And that is as much a feature as it is a bug. The taxonomy I came up with has three dimensions:

Type of data
Frequency of data
Scale of data

Types of data 🔗

The most common type of test data is probably “old copy of production data”. I have decided to leave this category out of the taxonomy and not give it a name, but it was a close call. The types of test data I have found are:

Fuzzing data: Loosely adopted from the term fuzzing in computer security, this is data that deliberately does not conform to the target schema. Data with text where there should be numbers, way to long fields, invalid dates, etc. The data is designed to test how data ingestion systems handle errors.
Placeholder data: This is data that conforms to the schema, but contains no realistic values. Character columns with “foo” and “bar”, 99 and 100 in the age column, exclusively 1900-01-01 in some date column. The computer works through it with the greatest of ease, leaving only errors that are caused by logic, not by data.
Difficult data: This is a carefully curated set of records designed to contain a number of edge-cases and extreme values that are rare but not impossible in production data.
De-identified data: This is a copy of production data, but with identifiers either censored, hashed or otherwise replaced with similar-looking values. Note that this is not the same as anonymized data, and if you have strong privacy requirements this type of test data will not be sufficient.
Anonymized data: Similar to de-identified data, but where also non-identifying values are masked/changed to satisfy K-anonymity and L-diversity requirements. Even though the data is based on real production data, it will not be possible to re-identify or infer anything about any person in a dataset properly anonymized like this.
Synthetic data: This is data that has been constructed with the goal of having the same statistical properties as production data. Constructing good synthetic data is usually a difficult process, especially to produce truly realistic synthetic data. Data analysis can uncover very complex patterns, and it is unrealistic for generic statistical processes to reproduce all of these.

Name	Type conformance	GDPR-compliant	Business realistic
Fuzzing data	No	Yes	No
Placeholder data	Yes	Yes	No
Difficult data	Yes	Yes	No
De-identified data	Yes	No	Yes
Anonymized data	Yes	Yes	Yes
Synthetic data	Yes	Yes	Yes

Frequency 🔗

The options here are quite simple:

Static
Changing

Most test data is static. Even when test data changes, it is rarely in the form of regular data ingest. For a lot of development and testing purposes, this can be important. Changing data is an additional moving part, when understanding a static system is hard enough. Constantly dealing with new data can make it much more difficult. But sometimes, it is necessary to run a pipeline on new data in order to test that it works properly. And in these cases, a stream of test data can make development and testing a lot easier.

Scale of data 🔗

Minimal
True scale

When developing, small datasets reduce run times and significantly increases iteration speeds. But putting a pipeline into a high-volume data environment without having tested the performance, is a gamble at best.

No one size fits all 🔗

Some of these categories are opposites, others can in principle be combined. You can have synthetic data that is also difficult data, but you can’t have fuzzing data that is also realistic in a business sense. And you can’t have data that is both big and small.

Hopefully, names and definitions help teams talk and think about what they want and need. Just saying “test data” is not enough.