Radbrt
The topic of test data comes up from time to time, and is plagued by the fact that test data can mean many different things. And that these things don’t have names.
A test data topology 🔗I have had the idea of a taxonomy of test data for a while. Like most taxonomies it won’t catch all nuances, or edge cases. And that is as much a feature as it is a bug.
The hype has subsided now, but you can still see it: The stack-fixation. Data teams comparing their data stacks, as if some magical combination of open-source and SaaS tools would solve all the problems. Fortunately few really believed a SaaS would save the world, but it could seem like it at times. Because tools are easy to talk about. The second easiest thing to talk about is how we shouldn’t talk about tools.
Today I changed visibility on my Metadog repository from private to public, and added an Apache 2 license. You can find it here: https://github.com/radbrt/metadog.
More comprehensive introductions are hopefully to come, but I wanted to introduce it and explain what and why.
Why Metadog 🔗I made Metadog as part of my job as a data engineer, where I needed to keep track of data on a number of different upstream systems (databases, SFTP servers, blob storage…) as well as in our own databases.
One of the many corners of the (post-)modern data stack I have kept an eye on is observability. I recently revisited it, and while little has changed, much has changed.
At its core, observability is about process monitoring. Finding changes, because changes might be errors. Perhaps interestingly, status quo is rarely suspected to be an error.
Mostly, observability is about finding changes in data. Changes in row counts. Changes in distinct values.
I started writing this post a while back, but now that it has stayed half-done for several months I’m posting what I have.
I wrote about CLIP models a while back, but from a high-level “what are they and what can they be used for” perspective. Now I have had the chance to work more with clip models directly in python, and they are still impressive.
You can use clip models with the transformers library from huggingface, there are special CLIPModel and CLIPProcessor classes:
In one of my first classes of ECON 101, the lecturer talked about economic models and likened them to maps. For reasons I would understand later, he argued the idea that maps are a miniaturized simplified version of the landscape. Many might wish for a more detailed map, but as the map gets more detailed you would end up with a 1:1 map to drape over the landscape. Needless to say, such a map would serve no purpose.
I have written (ranted?) about data products before. In part triggered by David Jayatillake. After an interesting article on credit scores as data products: https://davidsj.substack.com/p/risky-data. I want to structure my thoughts about the data products I have been making for years of my career: Official Statistics.
So similar, so different 🔗Analogies to companies that sell data (credit agencies, ESG data providers, financial data etc) have not been used as inspiration or even a point of reference for the data product and data mesh crowd.
Warning: amateur security writeup
IT Security is fairly preoccupied with web application security. Not surprisingly, perhaps, but it leaves an empty space where I would have loved to see content intended for other audiences as well. So I am taking the recent XZ backdoor as an opportunity to think aloud about how data engineers need to think about security.
What is different about data engineering 🔗Web development, by its nature, is about creating systems that answers to random requests from the internet.
A little while ago there was a small thread on Mastodon about data products, and David Jayatillake ended up writing a substack post explaining it: https://davidsj.substack.com/p/what-is-a-data-product. David’s posts usually land somewhere in the spectrum between “interesting” and “not my wheelhouse” with me, but this one seemed a little strange.
This isn’t the first time I have seen “what is a data product” discussions, and two very different answers are tempting.
Running Production ML with Snowflake and dbt 🔗Snowflake runs Python/pyspark now, which is cool. And so, it lets you train models and do predictions and whatnot. But the world has come a long way since training a model was impressive. Nowadays, training is table-stakes but serving, tracking, monitoring and everything else related to deploying and maintaining models in production. Model registries, ML Flow, Weights and biases, model serving via REST APIs… all in a day’s work for the people at Databricks or similar places.