Thinking About PoCs

Dec 4, 2024 · 982 words · 5 minute read

Yes, that PoC: https://www.reddit.com/r/dataengineering/comments/1h2t8op/dbt_poc_in_our_company_ended_in_a_disaster/

Reddit headline: DBT POC in our company ended in disaster, security breaches and immediate forced uninstall

Go read it if you haven’t. The comments too, although they are pretty much all saying the same thing.

In brief, the post describes the following:

Analytics team pushes through a dbt PoC
Security team and everyone sits back and watches
Analytics team deliver some dashboards in record time
People discover numbers don’t match
Analytics team discover the entire codebase is spaghetti
People discover there is basically no access control on new tables

And the comments basically go:

Don’t PoC with sensitive data
Don’t make a mess

Good advice. I want to expand on both points. And maybe add a few.

The knowledge gap 🔗

One of the promises of dbt is to expand the pool of people who can create and maintain dashboards etc. It isn’t an empty promise. I have seen this happen, and it is awesome. One-man-show business processes living in excel can turn into an orderly process with code, data in a database, code review, regular updates, etc. And be exported to excel, of course. But it is a process improvement.

Onboarding business people/analysts to dbt requires teaching them SQL, dbt, command line, git, pull requests, CI, and unless you are on dbt Cloud it also requires a basic knowledge of python environments. We can teach all that, no problem.

But those are just the technical requirements needed to make a model - any model - land in production. Security and not making a mess isn’t part of it.

Teaching not to make a mess 🔗

Someone who has worked in data for a while will hopefully have a bit of an intuition for what needs to be common models, how to create models that others want to use, and know what pain it is to maintain duplicated logic. Analysts who are jumping into SQL don’t. They can either learn because we teach them, or they can learn because they get burnt.

I’m all in favor of teaching SQL hygiene, but sometimes experience is the best teacher. There will be messes. You need to figure out how to deal with them. All worthwhile tools allow users to shoot themselves in the foot. I will happy hand people this particular shotgun.

There is one lingering question though: Even the most efficient analyst needs time to make a mess. Why did the PoC last that long? And if business people got upset by wrong numbers in the dashboard, the PoC was in production. As PoCs are wont to do.

Teaching security 🔗

I have lamented the lack of relevant security training before, but in this case I’d argue don’t teach security. This is not a shotgun I want to give to anyone. Don’t leave security as something others can fuck up. But how can we create a system where dbt developers can’t fuck up access? We probably can’t in all cases, but we can come close.

Snowflake has the unfortunate feature that if you create something you own it, and if you own it you can grant privileges on it. But snowflake also has a redeeming feature: You can create tables in databases you don’t own, and while analysts can grant selects to the table they created, they won’t do much as they can’t grant usage on the database where it lives. So make sure analysts don’t own databases, and that they don’t have accountadmin.

The second issue is to create sensible role hierarchies and boundaries. Databases are good boundaries. The easiest solution is probably to say 1 team = 1 access policy = 1 database. This isn’t always workable, but it is a great start. So the sales team has access to the sales database. Analytics engineers have write, while users have read. To the entire database. And nothing else. In short, make sure analytics engineers in the sales team can’t read from databases that aren’t available to users in the sales team.

This breaks down when an analytics engineer work on two teams. While the two teams use different roles, snowflake lets users use secondary roles for reads, and the engineer can read from the HR team database and write to the Sales team database without even a warning.

If you do it right, it isn’t a PoC 🔗

Some of the advice above is hard-earned, and it would be unreasonable to expect someone starting out with a PoC to know all this. Especially when the security team isn’t providing input. Doing a PoC the right way would be time-consuming and require a lot of investment before you could even evaluate dbt.

The other option is a version of the one that got Snowflake in hot water this summer: YOLO-create a Snowflake account, but only load publically available data so that the leak data, when it comes, doesn’t really matter. Your account will still be on a list of compromized accounts, but the data for sale on the dark web will just be a copy of he NYC Payroll dataset from Kaggle.

The issue with these types of PoCs is that they are boring. It takes a lot of imagination to go from 2-3 random tables that hopefully share a join key to a dashboard anyone in leadership can get excited about.

Conclusion 🔗

One might imagine a different post-mortem after the fiasco:

If we are to go forward, we would need to design an access framework that removes the question of access from developers.
If we are to go forward, the new developers need training in common data warehouse design principles.
The PoC lasted too long, and entered production without us noticing.

Someone has surely made most of the same mistakes but kept a cooler head when doing the post-mortem. PoCs are great for learning.

I’m pretty sure there have been similar PoCs where this has been the conclusion.

In the end, I guess there are 3 types of PoCs:

Dull
Difficult
Dangerous

Pick your poison, I guess.