Today I changed visibility on my Metadog repository from private to public, and added an Apache 2 license. You can find it here: https://github.com/radbrt/metadog.
More comprehensive introductions are hopefully to come, but I wanted to introduce it and explain what and why.
Why Metadog ๐
I made Metadog as part of my job as a data engineer, where I needed to keep track of data on a number of different upstream systems (databases, SFTP servers, blob storage…) as well as in our own databases. Instead of just monitoring for changes in data that has already been loaded, I wanted to monitor data before I loaded it.
The design ๐
I wanted Metadog to be low-maintenance with a small security footprint. Therefore, there is no web server, no hosted dashboards included in this. There is a data model, and there is data. Feel free to build your own dashboard on top of it, or use whatever alerting framework you want on top of it. The framework is open, and the data is open. Some assembly required - by design.
The architecture is inspired by Meltano: - The config is a simple yaml file - Passwords etc can be read from env variables - It is a CLI tool - It is designed for extensibility
How to use ๐
Install metadog ๐
You can pip-install metadog with
pip install git+https://github.com/radbrt/metadog.git
This will make the metadog
command available on your cli.
Initialize a new project ๐
The next step is to initialize a new metadog project with metadog init <new-folder-name>
. This will create a new folder containing an initial metadog configuration file.
Update the configuration file with your data sources.
Run a scan ๐
Metadog can scan the data sources you set up with metadog scan
. You can optionally add a --select
option to only scan a subset of your sources.
By default, scan results are stored in a sqlite database in the project folder. Open it with your favorite SQL Client and scan the results.
Change the metadog backend ๐
Metadog can use a number of different backends by setting the METADOG_BACKEND_URI
env variable to a database connection string.
The future ๐
I honestly don’t do as much ingest as I once did, and daily ingest tasks from unstructured sources are difficult to simulate from a civilian perspective. But there are some things I want to add:
- More/better algorighms for outlier/anomaly detection
- Ability to define custom rules in YAML
- Ability to parse more file formats, primarily JSON
- Better support for passwordless authentication
I have also changed my mind a bit, and I would like to be able to generate a simple, static HTML report/website that can be statically hosted.