

Introducing Metadog

Today I changed visibility on my Metadog repository from private to public, and added an Apache 2 license. You can find it here: https://github.com/radbrt/metadog. More comprehensive introductions are hopefully to come, but I wanted to introduce it and explain what and why. Why Metadog 🔗I made Metadog as part of my job as a data engineer, where I needed to keep track of data on a number of different upstream systems (databases, SFTP servers, blob storage…) as well as in our own databases.

Observability 2024

One of the many corners of the (post-)modern data stack I have kept an eye on is observability. I recently revisited it, and while little has changed, much has changed. At its core, observability is about process monitoring. Finding changes, because changes might be errors. Perhaps interestingly, status quo is rarely suspected to be an error. Mostly, observability is about finding changes in data. Changes in row counts. Changes in distinct values.

More About Clip Models

I started writing this post a while back, but now that it has stayed half-done for several months I’m posting what I have. I wrote about CLIP models a while back, but from a high-level “what are they and what can they be used for” perspective. Now I have had the chance to work more with clip models directly in python, and they are still impressive. You can use clip models with the transformers library from huggingface, there are special CLIPModel and CLIPProcessor classes:

Px Files

This is amazingly nerdy and you should probably go find something better to do with your life. But I got challenged to write a PX file the other day. PX files are used by the PXWeb application, which is some now fairly old software for making statistical data available. So the statistical agencies in Norway, Sweden, Finland and Estonia (and maybe others) have to create these files somehow. But because there isn’t a wide install base, resources are scarce.

Data Is an Abstraction

In one of my first classes of ECON 101, the lecturer talked about economic models and likened them to maps. For reasons I would understand later, he argued the idea that maps are a miniaturized simplified version of the landscape. Many might wish for a more detailed map, but as the map gets more detailed you would end up with a 1:1 map to drape over the landscape. Needless to say, such a map would serve no purpose.

Data Products at Statistics Norway

I have written (ranted?) about data products before. In part triggered by David Jayatillake. After an interesting article on credit scores as data products: https://davidsj.substack.com/p/risky-data. I want to structure my thoughts about the data products I have been making for years of my career: Official Statistics. So similar, so different đź”—Analogies to companies that sell data (credit agencies, ESG data providers, financial data etc) have not been used as inspiration or even a point of reference for the data product and data mesh crowd.

Security for Data Engineers

Warning: amateur security writeup IT Security is fairly preoccupied with web application security. Not surprisingly, perhaps, but it leaves an empty space where I would have loved to see content intended for other audiences as well. So I am taking the recent XZ backdoor as an opportunity to think aloud about how data engineers need to think about security. What is different about data engineering đź”—Web development, by its nature, is about creating systems that answers to random requests from the internet.

Data Products Once Again

A little while ago there was a small thread on Mastodon about data products, and David Jayatillake ended up writing a substack post explaining it: https://davidsj.substack.com/p/what-is-a-data-product. David’s posts usually land somewhere in the spectrum between “interesting” and “not my wheelhouse” with me, but this one seemed a little strange. This isn’t the first time I have seen “what is a data product” discussions, and two very different answers are tempting.

Production ML with Snowflake and dbt

Running Production ML with Snowflake and dbt 🔗Snowflake runs Python/pyspark now, which is cool. And so, it lets you train models and do predictions and whatnot. But the world has come a long way since training a model was impressive. Nowadays, training is table-stakes but serving, tracking, monitoring and everything else related to deploying and maintaining models in production. Model registries, ML Flow, Weights and biases, model serving via REST APIs… all in a day’s work for the people at Databricks or similar places.

Random Things 2024-02-25

Another non-comprehensive list of things I have read and/or thought about since last time: Data Is Plural: A weekly newsletter with links to datasets. I have been down the professional ETL rabbit hole for a while now, and the thought of a dataset just existing as it is, without being some steady stream of new data shifting and changing, is a relief. Sometimes, data is just data. https://www.data-is-plural.com/ The new Zed editor is promising, but right now it is just a text editor with a built-in chatbot.