avatar

Radbrt

Data Is an Abstraction

In one of my first classes of ECON 101, the lecturer talked about economic models and likened them to maps. For reasons I would understand later, he argued the idea that maps are a miniaturized simplified version of the landscape. Many might wish for a more detailed map, but as the map gets more detailed you would end up with a 1:1 map to drape over the landscape. Needless to say, such a map would serve no purpose.

Data Products at Statistics Norway

I have written (ranted?) about data products before. In part triggered by David Jayatillake. After an interesting article on credit scores as data products: https://davidsj.substack.com/p/risky-data. I want to structure my thoughts about the data products I have been making for years of my career: Official Statistics. So similar, so different đź”—Analogies to companies that sell data (credit agencies, ESG data providers, financial data etc) have not been used as inspiration or even a point of reference for the data product and data mesh crowd.

Security for Data Engineers

Warning: amateur security writeup IT Security is fairly preoccupied with web application security. Not surprisingly, perhaps, but it leaves an empty space where I would have loved to see content intended for other audiences as well. So I am taking the recent XZ backdoor as an opportunity to think aloud about how data engineers need to think about security. What is different about data engineering đź”—Web development, by its nature, is about creating systems that answers to random requests from the internet.

Data Products Once Again

A little while ago there was a small thread on Mastodon about data products, and David Jayatillake ended up writing a substack post explaining it: https://davidsj.substack.com/p/what-is-a-data-product. David’s posts usually land somewhere in the spectrum between “interesting” and “not my wheelhouse” with me, but this one seemed a little strange. This isn’t the first time I have seen “what is a data product” discussions, and two very different answers are tempting.

Production ML with Snowflake and dbt

Running Production ML with Snowflake and dbt 🔗Snowflake runs Python/pyspark now, which is cool. And so, it lets you train models and do predictions and whatnot. But the world has come a long way since training a model was impressive. Nowadays, training is table-stakes but serving, tracking, monitoring and everything else related to deploying and maintaining models in production. Model registries, ML Flow, Weights and biases, model serving via REST APIs… all in a day’s work for the people at Databricks or similar places.

Random Things 2024-02-25

Another non-comprehensive list of things I have read and/or thought about since last time: Data Is Plural: A weekly newsletter with links to datasets. I have been down the professional ETL rabbit hole for a while now, and the thought of a dataset just existing as it is, without being some steady stream of new data shifting and changing, is a relief. Sometimes, data is just data. https://www.data-is-plural.com/ The new Zed editor is promising, but right now it is just a text editor with a built-in chatbot.

Random Things 2024-02-17

A non-comprehensive list of things I have read and/or thought about lately: Artificial intelligence and privacy: Daniel Solove is one of the foremost scholars on privacy. His 2008 book “Understanding Privacy” is timeless, and he now he has a prepublication article on the newest AI trends and privacy: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4713111. It is an early draft, it contains more questions than answers and it writes some checks his references can’t cash. So don’t take it as gospel.

Prefect & Coiled

If you process significant amounts of data, but think Spark is a little messy, you have probably tried out Dask. And if you want to scale out Dask across multiple nodes, Coiled is your friend. Coiled is an on-demand dask cluster running in your own infrastructure, and best of all, it is designed to be invoked from your IDE, lifting the computation from your laptop to the cloud only when you want and running on your laptop otherwise.

Dbt Unit Tests

dbt has recently introduced unit tests in addition to their regular tests. According to dbt labs’ release plan, unit tests are to be launched with dbt version 1.8, which is scheduled for release this spring. In the meantime, you can check out the main branch of dbt-core, and run unittests with postgres. Tests in dbt are run against data in the database, which means that the database becomes a dependency for the tests.

Testing Singer REST taps

I probably need to deal with this subject not only separately, but in installments. To set the scene, my tap-prefect comes with a test-suite from the SDK, which yields 5 warnings, 94 errors in 5.38s The singer_sdk test suite that comes bundled is… convoluted. And it seems it by default tries to run it without trying to authenticate. I tried to go along with the test suite but without success. And after I started suspecting there wasn’t a good way to actually test the output I decided to go my own way.