Data Products at Statistics Norway

I have written (ranted?) about data products before. In part triggered by David Jayatillake. After an interesting article on credit scores as data products: https://davidsj.substack.com/p/risky-data. I want to structure my thoughts about the data products I have been making for years of my career: Official Statistics. So similar, so different đź”—Analogies to companies that sell data (credit agencies, ESG data providers, financial data etc) have not been used as inspiration or even a point of reference for the data product and data mesh crowd.

Security for Data Engineers

Warning: amateur security writeup IT Security is fairly preoccupied with web application security. Not surprisingly, perhaps, but it leaves an empty space where I would have loved to see content intended for other audiences as well. So I am taking the recent XZ backdoor as an opportunity to think aloud about how data engineers need to think about security. What is different about data engineering đź”—Web development, by its nature, is about creating systems that answers to random requests from the internet.

Data Products Once Again

A little while ago there was a small thread on Mastodon about data products, and David Jayatillake ended up writing a substack post explaining it: https://davidsj.substack.com/p/what-is-a-data-product. David’s posts usually land somewhere in the spectrum between “interesting” and “not my wheelhouse” with me, but this one seemed a little strange. This isn’t the first time I have seen “what is a data product” discussions, and two very different answers are tempting.

Production ML with Snowflake and dbt

Running Production ML with Snowflake and dbt 🔗Snowflake runs Python/pyspark now, which is cool. And so, it lets you train models and do predictions and whatnot. But the world has come a long way since training a model was impressive. Nowadays, training is table-stakes but serving, tracking, monitoring and everything else related to deploying and maintaining models in production. Model registries, ML Flow, Weights and biases, model serving via REST APIs… all in a day’s work for the people at Databricks or similar places.

Random Things 2024-02-25

Another non-comprehensive list of things I have read and/or thought about since last time: Data Is Plural: A weekly newsletter with links to datasets. I have been down the professional ETL rabbit hole for a while now, and the thought of a dataset just existing as it is, without being some steady stream of new data shifting and changing, is a relief. Sometimes, data is just data. https://www.data-is-plural.com/ The new Zed editor is promising, but right now it is just a text editor with a built-in chatbot.

Random Things 2024-02-17

A non-comprehensive list of things I have read and/or thought about lately: Artificial intelligence and privacy: Daniel Solove is one of the foremost scholars on privacy. His 2008 book “Understanding Privacy” is timeless, and he now he has a prepublication article on the newest AI trends and privacy: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4713111. It is an early draft, it contains more questions than answers and it writes some checks his references can’t cash. So don’t take it as gospel.

Prefect & Coiled

If you process significant amounts of data, but think Spark is a little messy, you have probably tried out Dask. And if you want to scale out Dask across multiple nodes, Coiled is your friend. Coiled is an on-demand dask cluster running in your own infrastructure, and best of all, it is designed to be invoked from your IDE, lifting the computation from your laptop to the cloud only when you want and running on your laptop otherwise.

Dbt Unit Tests

dbt has recently introduced unit tests in addition to their regular tests. According to dbt labs’ release plan, unit tests are to be launched with dbt version 1.8, which is scheduled for release this spring. In the meantime, you can check out the main branch of dbt-core, and run unittests with postgres. Tests in dbt are run against data in the database, which means that the database becomes a dependency for the tests.

Testing Singer REST taps

I probably need to deal with this subject not only separately, but in installments. To set the scene, my tap-prefect comes with a test-suite from the SDK, which yields 5 warnings, 94 errors in 5.38s The singer_sdk test suite that comes bundled is… convoluted. And it seems it by default tries to run it without trying to authenticate. I tried to go along with the test suite but without success. And after I started suspecting there wasn’t a good way to actually test the output I decided to go my own way.

Improved testing of Singer taps and targets

By now, I maintain quite a few Singer taps and targets, created with Meltano’s singer_sdk library. For those unfamiliar with Singer, it is a framework from moving data from a source system to a target system via a standardized communication protocol. The singer_sdk library contains a bunch of sweet abstractions that make this a lot easier. Some of the things I maintain: target-oracle for writing to Oracle databases1 target-mssql for Writing to SQL Server databases2 tap-prefect for reading from the Prefect REST API3 tap-pxwebapi for reading statistics from Statistics Norway/Sweden/Finland via a REST API.