
Radbrt
I have written (ranted?) about data products before. In part triggered by David Jayatillake. After an interesting article on credit scores as data products: https://davidsj.substack.com/p/risky-data. I want to structure my thoughts about the data products I have been making for years of my career: Official Statistics.
So similar, so different đAnalogies to companies that sell data (credit agencies, ESG data providers, financial data etc) have not been used as inspiration or even a point of reference for the data product and data mesh crowd.
Warning: amateur security writeup
IT Security is fairly preoccupied with web application security. Not surprisingly, perhaps, but it leaves an empty space where I would have loved to see content intended for other audiences as well. So I am taking the recent XZ backdoor as an opportunity to think aloud about how data engineers need to think about security.
What is different about data engineering đWeb development, by its nature, is about creating systems that answers to random requests from the internet.
A little while ago there was a small thread on Mastodon about data products, and David Jayatillake ended up writing a substack post explaining it: https://davidsj.substack.com/p/what-is-a-data-product. Davidâs posts usually land somewhere in the spectrum between âinterestingâ and ânot my wheelhouseâ with me, but this one seemed a little strange.
This isnât the first time I have seen âwhat is a data productâ discussions, and two very different answers are tempting.
Running Production ML with Snowflake and dbt đSnowflake runs Python/pyspark now, which is cool. And so, it lets you train models and do predictions and whatnot. But the world has come a long way since training a model was impressive. Nowadays, training is table-stakes but serving, tracking, monitoring and everything else related to deploying and maintaining models in production. Model registries, ML Flow, Weights and biases, model serving via REST APIs⊠all in a dayâs work for the people at Databricks or similar places.
Another non-comprehensive list of things I have read and/or thought about since last time:
Data Is Plural: A weekly newsletter with links to datasets. I have been down the professional ETL rabbit hole for a while now, and the thought of a dataset just existing as it is, without being some steady stream of new data shifting and changing, is a relief. Sometimes, data is just data. https://www.data-is-plural.com/
The new Zed editor is promising, but right now it is just a text editor with a built-in chatbot.
A non-comprehensive list of things I have read and/or thought about lately:
Artificial intelligence and privacy: Daniel Solove is one of the foremost scholars on privacy. His 2008 book âUnderstanding Privacyâ is timeless, and he now he has a prepublication article on the newest AI trends and privacy: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4713111. It is an early draft, it contains more questions than answers and it writes some checks his references canât cash. So donât take it as gospel.
If you process significant amounts of data, but think Spark is a little messy, you have probably tried out Dask. And if you want to scale out Dask across multiple nodes, Coiled is your friend. Coiled is an on-demand dask cluster running in your own infrastructure, and best of all, it is designed to be invoked from your IDE, lifting the computation from your laptop to the cloud only when you want and running on your laptop otherwise.
dbt has recently introduced unit tests in addition to their regular tests. According to dbt labsâ release plan, unit tests are to be launched with dbt version 1.8, which is scheduled for release this spring. In the meantime, you can check out the main branch of dbt-core, and run unittests with postgres.
Tests in dbt are run against data in the database, which means that the database becomes a dependency for the tests.
I probably need to deal with this subject not only separately, but in installments.
To set the scene, my tap-prefect comes with a test-suite from the SDK, which yields
5 warnings, 94 errors in 5.38s The singer_sdk test suite that comes bundled is⊠convoluted. And it seems it by default tries to run it without trying to authenticate.
I tried to go along with the test suite but without success. And after I started suspecting there wasnât a good way to actually test the output I decided to go my own way.
By now, I maintain quite a few Singer taps and targets, created with Meltanoâs singer_sdk library. For those unfamiliar with Singer, it is a framework from moving data from a source system to a target system via a standardized communication protocol. The singer_sdk library contains a bunch of sweet abstractions that make this a lot easier.
Some of the things I maintain:
target-oracle for writing to Oracle databases1 target-mssql for Writing to SQL Server databases2 tap-prefect for reading from the Prefect REST API3 tap-pxwebapi for reading statistics from Statistics Norway/Sweden/Finland via a REST API.
â previous | This site is part of the Data People Writing Stuff webring random | index | what is this? |
next â |