Data Products Once Again

· 1002 words · 5 minute read

A little while ago there was a small thread on Mastodon about data products, and David Jayatillake ended up writing a substack post explaining it: David’s posts usually land somewhere in the spectrum between “interesting” and “not my wheelhouse” with me, but this one seemed a little strange.

This isn’t the first time I have seen “what is a data product” discussions, and two very different answers are tempting.

One is to say that data product as popularized in the later years is a concept within data mesh, which was coined by Zhamak Dehgani so a data product is whatever Zhamak says it is. In public, this has often been the gist of my answer. There are many books about data mesh so just go read one of them.

I also, sometimes, ask if there are thousands of Walmart and Aldi employees around the world standing in deep thought in front of rows of frozen pizzas, wondering what a product really is. Most likely, there aren’t. Most likely, that would be a dumb question and a product is more or less anything someone is willing to pay money for and someone else is willing to sell. So if we know what a product is, maybe the other half of the question is “what is data”? Despite the profoundness of the question, this might not be a fruitful avenue of inquiry.

The philosophical grocery store employee

I am generally a fan of the data mesh concept, which is where the “data product” term emerged in recent years (before that, Data-as-a-Service was all the rage). I like to think about data teams in different parts of a company sharing data directly without having some grand data warehouse as an intermediary - and without the accompanying 6 months response time to any request and almost certainty that nobody working with the data warehouse has any idea what it actually means. Data products are good, and it is great to have teams that are knowledgeable about both the data and the technology.

From the data mesh litterature, there are some high-level criteria for data products:

  • It is some sort of information.
  • It is adressable, which is a fancy way of saying it shouldn’t reside on your C:/ drive.
  • It should have some metadata, so that people have a clue about what it is.
  • It should have one or more “interfaces”, or ways to access it in normal words.

A sufficiently liberal interepretation of these criteria could let us define a cat as a possible data product, which makes me less surprised that data products to some people include dashboards, ML models and maybe ChatGPT.

A cat with a name tag

Several Mesh-ish implementations simply say that products must reside in their common database. This is probably heresy to the true Mesh enthusiasts, but it has the advantage of being doable. To simplify even further, some have said that as a starting point a data product is a single table. Others are comfortable with serving a data model up as a data product, or saying that their dataset resides in a database for those who feel like that, or a CSV/Parquet file in a bucket somewhere for those who prefer that.

Common to these implementations is that a data product is the information, combined with an “interface” which is some combination of the API and the data format.

Interfaces come in many forms though. The “I” in both GUI and API is for “interface”, so the technical definition of a data prodct doesn’t rule out a dashboard as a data product interface. But is data that only has a dashboard as an interface a data product? The problem with answering “yes” to that question is that it goes against the original problem data mesh intended to solve: Sharing data between analytical teams. Implicitly, this means joining data and genreally letting the analytics teams analyze the data in whatever fashion they prefer. Feel free to add a dashboard to your data product, but your consumers probably want programmatic access.

We think of data products as something that companies share internally, it seems that few take inspiration from the many companies that actually buy or sell data. Yet I suspect most data people live in organizations that buy data. And if they buy it, it kind of has to be a product. So far, the evidence suggest that it doesn’t take much for something to be a data product. Some companies even sell “We OCR’ed these freely available government PDF for you” as a data product.

The requirements of these commercial data products are usually:

  • A producer that enjoys high trust in the market.
  • Unique data or data that is processed in a way that takes resources/expertise.
  • Some data format, usually CSV files but sometimes fun stuff like Excel or XML (no, I haven’t seen parquet yet).
  • Some exchange format, usually SFTP but increasingly Snowflake Marketplace or similar solutions. And yes, REST APIs too.
  • Some form of documentation of what the data actually is. Unlike what “enterprise metadata architects” dictate in their powerpoints, this documentation is often… imperfect.
  • Ideally, some notification system when there are changes to the data structure such as added/removed columns etc.

To end with David’s example of Bloomberg as a data vendor, they don’t just offer the bloomberg terminal but also loads of other data. REST APIs, SFTPs, and even isolated, integrated notebook environments with access to data through their own query language. Some of these are clearly data products, some of them might be a hybrid, but if the bloomberg terminal is a data product, I have trouble coming up with an operationalizable definition that won’t include their news division as well.

For all my reliance on data mesh to explain data products, I am actually happy that there are many different definitions. The data mesh community has had a tendency to be a little religious, reading Zhamak and a few others like scripture. While it can be annoying to read many different definitions of the same concept, I suspect the alternative is worse.