Introducing Metadog

Sep 28, 2024 · 439 words · 3 minute read

Today I changed visibility on my Metadog repository from private to public, and added an Apache 2 license. You can find it here: https://github.com/radbrt/metadog.

More comprehensive introductions are hopefully to come, but I wanted to introduce it and explain what and why.

Why Metadog 🔗

I made Metadog as part of my job as a data engineer, where I needed to keep track of data on a number of different upstream systems (databases, SFTP servers, blob storage…) as well as in our own databases. Instead of just monitoring for changes in data that has already been loaded, I wanted to monitor data before I loaded it.

The design 🔗

I wanted Metadog to be low-maintenance with a small security footprint. Therefore, there is no web server, no hosted dashboards included in this. There is a data model, and there is data. Feel free to build your own dashboard on top of it, or use whatever alerting framework you want on top of it. The framework is open, and the data is open. Some assembly required - by design.

The architecture is inspired by Meltano: - The config is a simple yaml file - Passwords etc can be read from env variables - It is a CLI tool - It is designed for extensibility

How to use 🔗

Install metadog 🔗

You can pip-install metadog with

pip install git+https://github.com/radbrt/metadog.git

This will make the metadog command available on your cli.

Initialize a new project 🔗

The next step is to initialize a new metadog project with metadog init <new-folder-name>. This will create a new folder containing an initial metadog configuration file.

Update the configuration file with your data sources.

Run a scan 🔗

Metadog can scan the data sources you set up with metadog scan. You can optionally add a --select option to only scan a subset of your sources.

By default, scan results are stored in a sqlite database in the project folder. Open it with your favorite SQL Client and scan the results.

Change the metadog backend 🔗

Metadog can use a number of different backends by setting the METADOG_BACKEND_URI env variable to a database connection string.

The future 🔗

I honestly don’t do as much ingest as I once did, and daily ingest tasks from unstructured sources are difficult to simulate from a civilian perspective. But there are some things I want to add:

More/better algorighms for outlier/anomaly detection
Ability to define custom rules in YAML
Ability to parse more file formats, primarily JSON
Better support for passwordless authentication

I have also changed my mind a bit, and I would like to be able to generate a simple, static HTML report/website that can be statically hosted.