How to Stop Worrying and Love Medallion

Β· 1596 words Β· 8 minute read

The inherent uselessness of Medallion πŸ”—

Let’s get this out of the way first: I don’t like Medallion.

I am able to formulate 3 reasons for why:

  • It implies something about data quality. That data gets refined and quality somehow objectively improves on its way from Bronze to Gold, due to the magic power of IT people who have no idea what they are doing.
  • It is inherently meaningless. Bronze/Silver/Gold are not pre-existing concepts in the data domain. Data is not metal, there are not even attempts at creating analogies other than the data quality fiction. I suspect Medallion architecture is popular exactly because of its foundational meaninglessness. It can be anything you want, and years have been spent debating what it actually means and nobody seems to agree. This careful interpretation of ancient texts (usually random, abandoned Medium articles) is practically elevated to a field of study - an endless time-sink for no apparent purpose.
  • It is admitting you have no clue what your data is or how it is used. Any specific domain will be able to come up with a better architecture more suited to their purpose. Are you doing data de-ingestion? Maybe that is a useful layer. Are you serving a lot of data to a wide audience using Streamlit apps? OK, maybe a layer. Are your users defining their own tables, kind of a scratchpad or sandbox? Cool, might be a good layer for your use case. But this requires knowledge of your users and what they need. IT people don’t like their users. They like other IT people and meaningless arguments about the platonic idea of “Silver”.

But this meaninglessness of the architecture is also why I want to ignore it. To quote Exodus (or Fight Club): Let that which doesn’t matter, truly slide. I got a job to do. Nobody pays us for made-up religious fights.

Data architectures in general πŸ”—

Medallion is one of many data architectures. The idea of data architectures in general is to define “data states”. From Falling Leftwards:

Data states are the labels that are attached to a given data object at some point throughout its lifecycle. These labels provide information on what can be expected from the data, both in terms of characteristics and use.

This definition states that data has a lifecycle, which probably makes sense to most people - although this lifecycle might vary. It also talks about expectations of the data, characteristics, and use. These are broad terms - to ask for the characteristics of a dataset is only slightly less absurd than to ask for the characteristics of your child. It is not only a hopelessly open question. The answer depends on your perspective - your relation.

Since both lifecycle, characteristics and use are impossible questions to answer in general, it seems the answer that won was the one that didn’t answer any of them.

Who owns an architecture? πŸ”—

I have only heard this question alluded to once. But if the IT department bought Medallion, it stands to reason that the IT department owns it. While certain business users may be consulted, IT people are the ones who decide the definitions, and categorizes datasets. But are they also the primary audience?

It has been argued that not only is architecture an IT thing, but perhaps users shouldn’t even be exposed to data architectures and Medallion terms like Bronze/Silver/Gold at all - that these terms are by and for IT.

Making it real πŸ”—

The question of ownership and audience is also vital in the further operationalization of Medallion.

Bronze/Silver/Gold categories aren’t worth much if they can’t be written down somewhere, and they are frequently used as schema or database names. But if architectures - and Medallion - is by and for the IT department and shouldn’t be surfaced to users (in any meaningful way at least), we will have to find somewhere else than database- and schema names to use. Options that won’t be meaningfully visible to users include using object tags in the database, folders in dbt projects, or simply documenting it on Confluence.

The latter option allows the Bronze/Silver/Gold data quality references to exist within the scope of the IT department’s job.

What needs does Medallion meet? πŸ”—

Other than the source of its popularity - the ability to mean anything you want it to mean - Medallion architecture does have some features:

  • Medallion has a fairly crisp definition of Bronze, so after you have decided what area(s) of the data platform should contain Bronze, you have a place to put data that lands.
  • The gold layer is the home for anything that serves BI tools - typically star schemas. So if you got star schemas, you probably know where to put them.
  • Most other stuff probably goes in the silver layer.

Coping mechanisms πŸ”—

With all that out of the way, how do you best handle the Medallion paradigm and just get on with your day, once the direction has been set? This question is not technical. It is emotional hygiene.

Your inner Babelfish πŸ”—

One partial solution is to follow David Jayatillake’s suggestion and just mentally map the Bronze/Silver/Gold terminology to “landing/staging/mart” or “raw/staging/mart”. Nobody is deluding themselves about data quality, and the names actually have some inherent meaning.

This is a good start, and almost free other than the slight cognitive overhead of constantly having to translate the terms.

What it doesn’t solve:

  • You still might get dragged into endless discussions about what table should live where based on the metaphysical meaning of multiplying a column with 1.5.
  • You are still deprived of anything useful.

Treat your job as a patient πŸ”—

Caring about your job is a good thing, but you probably shouldn’t develop a parasocial relationship - not even an adversarial one - with an architecture. Instead, think of your job and your boss as if you were their doctor. You can advice them to stop smoking, you can even sincerely want them to stop smoking. But smoking is their choice, and their responsibility. Their demise will be the cautionary tale that inspires your next patient to quit.

Embrace the good parts πŸ”—

Of the different architectures that can be chosen, Medallion probably has some advantages for your use case too. It is wonderfully unprescriptive, which means a lot of decisions are left to the implementation. How can data be joined? Up to you. Who can have access? Up to you. Of course, non-prescriptive can also imply there will be fights about it, but these questions can be decided based on needs rather than reading of ancient texts.

Mitigations and compromises πŸ”—

Once the Denial, Anger, Bargain and Depression has given way to Acceptance, there are new battles to be fought. We have previewed one already: How do you operationalize Medallion? We will come back to that one.

Identify your actual need πŸ”—

One thing IT people love doing, which I partially support, is to wince when business-people suggest a technical solution. It is much better to try to understand the actual business need and dictate the technical solution based on that attempt at understanding business than to let business dictate a solution based on their attempt at understanding IT. So let’s understand what needs a more thoughtful architecture would solve:

  1. A more logical unit for access control - the system user that the Streamlit app uses can get access to its own schema rather than some arbitrary set of tables.
  2. A way to indicate purpose or topic (you decide which) instead of indicating an imagined data quality aspect.
  3. Less time wasted debating the concept of a substr() and how that maps to the arbitrary Bronze/Silver/Gold terminology.

Find the right place for it πŸ”—

If IT wants to use Medallion, that might be fine. But don’t let Medallion materialize in database/schema names without careful consideration. Does it work better as a tag? Can you have a Confluence page that categorizes each database as either Bronze, Silver or Gold? Can it be your top-level model folders in dbt? All of these options not only limits Medallion’s blast radius, it also gives it scope. The terms and definitions belong to IT.

Adding purpose and topic πŸ”—

In modern data platforms (Snowflake/Databricks), we have two hierarchy levels above the table/view: Schema and Database (or Database and Catalog if you want to follow Databricks’ nomenclature, which I won’t do here). Even if one of these levels has been reserved for Medallion, there is still a place for purpose or audience or topic or whatever makes sense to your actual need. This would address need #2 and/or #1 above, depending on how you want to organize things.

Avoid wasting time πŸ”—

There are dozens of articles that try to interpret and operationalize the ancient texts. Pick one. A short but clear one. And stick with it. Don’t worry about what ends up where. Nobody cares, especially if you are able to use schemas to denote purpose/topic.

Medallion for domains πŸ”—

The claim of data quality improvement that is inherent in Medallion, can be better defended when looking at specific domains. As each domain has its own purpose, the same dataset can be silver in one domain and gold in another - because quality is better defined as “fitness for purpose”. This makes the most sense when data engineers are embedded in the domain team and have a non-trivial level of domain knowledge.

Yet, in that scenario, data engineers will likely come up with a better architecture than Medallion because they understand the use cases and particulars of the domain. Medallion is defensible exactly in the situations where you least need it.