avatar

Radbrt

Random Things 2024-02-17

A non-comprehensive list of things I have read and/or thought about lately: Artificial intelligence and privacy: Daniel Solove is one of the foremost scholars on privacy. His 2008 book “Understanding Privacy” is timeless, and he now he has a prepublication article on the newest AI trends and privacy: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4713111. It is an early draft, it contains more questions than answers and it writes some checks his references can’t cash. So don’t take it as gospel.

Prefect & Coiled

If you process significant amounts of data, but think Spark is a little messy, you have probably tried out Dask. And if you want to scale out Dask across multiple nodes, Coiled is your friend. Coiled is an on-demand dask cluster running in your own infrastructure, and best of all, it is designed to be invoked from your IDE, lifting the computation from your laptop to the cloud only when you want and running on your laptop otherwise.

Dbt Unit Tests

dbt has recently introduced unit tests in addition to their regular tests. According to dbt labs’ release plan, unit tests are to be launched with dbt version 1.8, which is scheduled for release this spring. In the meantime, you can check out the main branch of dbt-core, and run unittests with postgres. Tests in dbt are run against data in the database, which means that the database becomes a dependency for the tests.

Testing Singer REST taps

I probably need to deal with this subject not only separately, but in installments. To set the scene, my tap-prefect comes with a test-suite from the SDK, which yields 5 warnings, 94 errors in 5.38s The singer_sdk test suite that comes bundled is… convoluted. And it seems it by default tries to run it without trying to authenticate. I tried to go along with the test suite but without success. And after I started suspecting there wasn’t a good way to actually test the output I decided to go my own way.

Improved testing of Singer taps and targets

By now, I maintain quite a few Singer taps and targets, created with Meltano’s singer_sdk library. For those unfamiliar with Singer, it is a framework from moving data from a source system to a target system via a standardized communication protocol. The singer_sdk library contains a bunch of sweet abstractions that make this a lot easier. Some of the things I maintain: target-oracle for writing to Oracle databases1 target-mssql for Writing to SQL Server databases2 tap-prefect for reading from the Prefect REST API3 tap-pxwebapi for reading statistics from Statistics Norway/Sweden/Finland via a REST API.

A eulogy for Meltano Cloud

The beginning and the end 🔗The days of the modern data stack were waning. Interest rates were soaring. And the appetite for Yet Another SaaS was plummeting among both companies and investors. Meltano Cloud entered public Beta behind everyone else, and behind their own schedule. And it disappeared before anyone else. The Meltano team is now working on Arch, a new adventure for similar but different use cases. Perhaps Meltano Cloud was too late to market.

tap-pxwebapi

A Singer Tap for Official Statistics 🔗In the world of data engineering, Singer is popular standard, with tools like Airbyte and Meltano providing a flexible framework for data loading. One source that is often overlooked for data loading needs however, is official statistics. Different statistical offices around the world have different APIs (and in some cases no API at all), but one place to start is PxWeb. PxWeb is a common thread connecting Norway, Sweden, and Finland in the realm of official statistics.

GPT reads plots. Kind of.

For whatever whimsical reason, as I read the financial paper, I got the idea to take a picture of one of the plots and ask ChatGPT to extract the data. My naive expectation was that the image processing function would just wing it and give me a few “eyeballed” observations from the plot. Not so. Instead of eyeballing it, it created close to 100 lines of python code that read the image, did contour analysis, and combined it with some observations from the image such as axis (min/max on both axis).

Clip Similarity Search

I came across a cool post by Drew Breunig about finding bathroom faucets with the CLIP model: https://www.dbreunig.com/2023/09/26/faucet-finder.html. Multi-modal embedding models let you embed both text and images in the same embedding space, enabling search across both images and text. Although multimodal embedding models are seemed a mostly a blank slate, there is at least one multimodal embedding model available: openAI’s CLIP. Simon Willison, who makes the llm cli tool, has also made a plugin for the CLIP model, so taking the CLIP model out for a spin is really simple.

24 hours of Surface Pro

24 hours with a Surface Pro 🔗Some months ago I severely cracked the screen of my iPad 11" (2018). It is still useable, but I have wanted a new one but at the same time I didn’t want to just get another iPad. So yesterday I got a Microsoft Surface Pro 8, in the hope that it could cover my iPad use and 90% of my laptop use and reduce the number of times I have to drag my laptop around.