Security for Data Engineers

Apr 1, 2024 · 1310 words · 7 minute read

Warning: amateur security writeup

IT Security is fairly preoccupied with web application security. Not surprisingly, perhaps, but it leaves an empty space where I would have loved to see content intended for other audiences as well. So I am taking the recent XZ backdoor as an opportunity to think aloud about how data engineers need to think about security.

What is different about data engineering 🔗

Web development, by its nature, is about creating systems that answers to random requests from the internet. Stand up a server, and let anyone in the world visit it. In the process, clients will make all kinds of requests - mostly inert, but sometimes malicious. This is where we see SQL Injection attacks (SQLi), Cross site scripting (XSS), etc.

Data engineering on the other hand, does not usually involve creating new websites for the world. So what should we think about when we think about security?

A non-comprehensive list:

Make sure your S3 buckets are not accidentally open to the world
System users needs strong passwords
Rotate credentials, because every once in a while passwords wander off
Think critically about what access your users have
Update software regularly
Monitor logs
Keep tabs on your upstream dependencies
Know your security exposure

That last point is ambiguous, and what I want to expand on.

What is security exposure? 🔗

As data engineers, we are fortunate in that most of our systems are not internet facing. In fact, most of our stuff isn’t even listening for any connections. While it is possible to attack a system that isn’t listening for connections, it is much harder.

So while you might be patting yourself on the back for not having created a website from scratch, you are not out of the woods. Some exposures that might be present in data engineering systems:

Your airflow server. Even though Airflow comes with authentication built in and smarter people than you wrote the software, it is never the less a server listening for connections. We’ll get back to that one.
Your container registry. There is a good chance you use Docker, and you have a container registry at one of the cloud providers and you maybe upload images to it regularly. Is the container registry reachable from all IPs in the world? How do you authenticate to it?
Storage accounts: This is largely similar to container registries. Most storage accounts have multiple ways to authenticate. Account keys, SAS tokens etc, so an attacker can either be on the lookout for an accidentally leaked authentication token, or simply try to brute-force the authentication.
Databases: While databases aren’t websites, they still listen for logins. On cloud databases like Snowflake, the risk is primarily leaked or easily guessed credentials. If you have your own postgres database, on the other hand, there is an actual linux server sitting underneath listening for connections. Hence, there are at least two attack surfaces that you are responsible for.

The XZ backdoor 🔗

An in-depth writeup, for the interested: https://news.risky.biz/risky-biz-news-supply-chain-attack-in-linuxland/

Unless you are quite minimalist, you probably have at least one linux VM spinning around somewhere. And that VM might be vulnerable to the XZ backdoor.

There is a lot of writing about the XZ backdoor in security circles, much of it is akin to a bunch of engineers fawning over the engine in an F1 car. But we don’t care about the elegance of the engine design. We care about what the car can do.

The XZ backdoor is a so-called Remote Code Execution (RCE) vulnerability. The person who made the backdoor can run code on any machine that has the backdoored version of the software by trying to log in with their private ssh key and passing some commands. Think about that for a second: Somewhere, someone has a key/password that lets them run code any linux machine that has installed a recent version of a quite popular linux program. Note that this is one of the linux programs few people even know exist. I had never heard of it before, but I have it on my mac. Fortunately not the vulnerable version, though. Try it yourself by running xz --version.

Back in your day job, you might have set up a linux VM with public key authentication and been so happy with the strong authentication that you let port 22 (ssh) be exposed to the world. You need to log in, after all, and the security department made away with the old VPN system a while back saying it was outdated and gave a false sense of security. And your own ISP changes your IP at random times, which is annoying. So you left port 22 open to the world, stored the private key in a safe place, and went for lunch.

Now though, that machine might be vulnerable. Over the past month, someone might already have executed code on it without you knowing. Your security department might even ask you for an audit, basically for you to go through yout logs and confirm that no surreptitious login attempt was made. And they will definetely want you to upgrade/downgrade so that you don’t have the vulnerable version anymore.

Wider lessons 🔗

Fred Brooks, the author of “The mythical man-month” dreamt of a world where code was widely shared and reused. We live in that world now. We also live in a world with a lot of internet, and this is, in a way, the drawback.

Most of our code has large upstream dependencies. We use giant libraries, and vulnerabilities will be found from time to time. But when vulnerabilities are found in servers, it is easier to exploit and we need to be available to remediate vulnerabilities on short notice. A popular rule of thumb is that if you are not able to patch a server within 24 hours, you shouldn’t run the server. Note that “patch a server” almost a meaphor by now, most often we are talking about repinning some requirements in a repo somewhere and deploying the VM again.

Even wider lessons 🔗

This particular attack is eye-catching in many ways:

The main component of it was social engineering, gaining trust with a maintainer of one of the many esoteric but important open-source components of modern computers.
It took 2 years to execute, which means there was almost certainly a state-level actor involved. Criminals don’t have that kind of patience.
It was incredibly targeted to only be exploitable by a single user.
The code itself was impressive.

Unless you work at a major, nationally critical institution (government, large bank, electrical utility etc), whoever was behind the attack probably couldn’t be bothered to attack you.

It once again highlights the supply chain vulnerabilities. Over the course of a few days, we all learned there is a utility named xz that is very central and maintained by a single person. Just like curl was, just like openssl was, just like… the next such vulnerability will turn out to be.

In the coming months, we are likely to hear a lot about Software Bill of Materials (SBOM), an initiative to account for all the upstream dependencies of a particular software. If we are lucky, we will also see efforts to identify more critical libraries maintained by a single person, in order to increase security.

In the meantime, back in your day job, it is safe to assume that the software you run has unknown security vulnerabilities. Since you don’t know about them you can’t fix them, but what you can do is make sure all your machines have as little access as possible. No internet access except for approved domains. No access to other machines in the VNet unless it needs to. Use managed offerings such as Amazon RDS rather than standing up your own Postgres database. And avoid “jack of all trades” VMs, that do a plethora of very different things.