I have written (ranted?) about data products before. In part triggered by David Jayatillake. After an interesting article on credit scores as data products: https://davidsj.substack.com/p/risky-data. I want to structure my thoughts about the data products I have been making for years of my career: Official Statistics.
So similar, so different π
Analogies to companies that sell data (credit agencies, ESG data providers, financial data etc) have not been used as inspiration or even a point of reference for the data product and data mesh crowd.
My own attempts at drawing parallels have been greeted with shrugs. Maybe there is a good reason for this: For businesses that sell data, data is their business. Data quality, production pipelines, prediction models and data catalogs are likely topics for the CEO to grapple with. For most companies though, data is a support function. A means to an end.
Data products everywhere π
Zhamak Dehghani’s 2019 post on Data Mesh (https://martinfowler.com/articles/data-monolith-to-mesh.html) felt strangely familiar to me. I had been working at Statistics Norway (SSB) for over 8 years at the time, and what Zhamak described was very similar to how SSB was organized internally and to SSB’s relationships to consumers.
Different statistics were produced by different teams. Data collection happens in a myriad of ways (surveys, register data, administrative data etc). Data is processed differently depending on the source and purpose, and the teams maintain very specific competence on the subject matter of their statistics. And so, the teams themselves decided what data to produce, how to produce it, and they themselves owned the data.
There was no central data warehouse. The common denominator from an infrastructure perspecive was a filesystem, some databases, some servers and some software that was shared. Teams could choose for themselves how they worked. SAS, SQL and R was available, data could be stored either on disk or in a database. For significant new projects, there was a lot of politics around what system to use, so the team might not have a real choice. But whoever made the choice, it was not dictated by what the other teams did.
Instead of a central data warehouse, data was loaned out autonomously between teams. For instance, education data was a relevant context for pretty much all teams that had data about people. Lots of teams needed the business register. Use of data across teams was agreed to directly between the parties. Every blue moon some conflict would arise, which would get escalated to a higher manager.
Data product wasn’t a term, but there were products and they often grew organically: Someone would ask for some data, and we might have to construct something to fit their purpose. If more people asked for roughly the same data, it made sense to maintain the script (pipeline) that constructed the dataset and run it as soon as new data became available. So the answer to “what is a data product” was simple: It was whatever enough people needed. Some of that data was hardly documented - sometimes because it was practically self-explanatory, sometimes because people are lazy. Other data products had over 1000 pages of documentation and some even had courses to teach people how to use it.
Perhaps because the data was the business, there wasn’t a focus on self-service. The data mesh community focuses a lot on self-service, but if we were going to use data from another team in our work, it was well worth sitting down with someone from the team that owned that data to make sure we understood it. Maybe that was possible because the data was the business, but I don’t see why self-service should be an absolute requirement. Documentation can be outdated, there are often some nuances that haven’t been covered in the documentation or that is hard to understand the consequence of, that is vital for the specific use case. Talking to people isn’t a failure. It is interesting and everybody learns from it.
With only a few exceptions, data products were simple tables. The “customers” were usually analysts with limited knowledge of data modeling, so simple data was more useful than artisinal data models.
Public data π
While one might debate the details of data products inside SSB, it is clearer that the data SSB publishes can be called data products.
The data that is published is pretty important. I have seen many news headlines, and at least one marketing campaign, based on data I published. We once crashed our web server because it couldn’t handle the traffic from news sites. We were that interesting. Lots of contracts are adjusted based on our price and cost indicies. Currencies and stock exchanges regularly react when the consumer price index is published. Sometimes I am amazed that this was just a mediocre, entry-level government job and all this was totally normal.
There is a wide range of users, just as there is a wide array of data products (statistics). For businesses, price indicies are perhaps the most important. These users were uninterested in anything other than the latest number, and valued fast publications. To them, data was a commodity and aged like fish. Another big group was academics and other socially inquisitive members of the public. To them, getting the latest data quickly wasn’t as important, but accessing long time series was valuable. Generally, I believe SSB underestimates the value of these long time series. It was too easy to discontinue one data product (time series) and create a new one, whenever a semi-significant change in the data happened. This leaves academics with the task of assembling longer timeseries from shorter fragments, a job SSB is much better positioned to do.
Assigning value to data is difficult. The data we published was free of course, but in deciding what to spend time on we somehow had to assess what was most valuable. And media attention was often seen as a proxy. Apart from the CPI, publishing price indicies was often a thankless task even if it was the basis for adjustment of multi-million dollar contracts. What we saw was all there was. And some back-office employee making a quarterly contract adjustment for contract on cement service equipment leases does not generate headlines. The same was true for longer time series.
The data SSB publishes has a simple structure. There isn’t any kind of data warehouse for public data, and if you want to link the data you have to do it yourself. What we published was for all practical purposes simple CSV files.
The ability to link data has been the white whale for some. For them, the dream is to let users seamlessly connect data about unemployment, income distribution, defaults and more. Perhaps wisely, that hasn’t been a priority. It would likely be incredibly costly and a waste of taxpayer money. From a data product perspective, would such a feature be a new product? A new interface? A newfound, horribly implemented instant legacy data warehouse over http?
Takeaways π
One of the most frustrating things I did was to try to communicate complex data to other teams. Simple terms like “uniqueness”, “foreign key”, “normalised”, “SCD2” were missing. After experiencing the real world, I am hopeful that this lack of a common vocabulary, and lack of understanding of data modeling, was simply a quirk of SSB.
Secondly, because of privacy, a lot of data products had to be tailored. If someone needed data on employment from 2005 to 2015, we couldn’t hand them a dataset that contained data on employment from 1985 to 2020. And sometimes the product was even more custom, like “full education history for people who got certified as a mechanic between 1994 and 1998”. We had full education history as a data product, but there is no practical way to create arbitrary self-service data products like that.
I also keep coming back to the need to talk to people. To make an analogy to real products: How many people buy a car without talking to a salesperson? I know some do, but most of us have questions, and aren’t content simply browsing the website. Cars are a significant investment, after all. Data products are too. It is a building block for the things you create. You want to make sure it can handle the load.
Because data is loaned out between teams again and again, some of the data products were themselves amalgamations of several different data products. For employment data, having some basic details on the employer is important. So the employment stats teams borrowed from the business register team, and loaned that joined data out to a third team. In data mesh parlance, this is called an intermediate data product. And the vast majority of products were intermediate products in some way or another. Somehow intermediate data products have become the middle child of data mesh. So many of my answers when talking about data mesh or data products, especially the gotchas, is “make an intermediate data product”.
The Gap π
While Statistics Norway looks more like a data mesh than a data warehouse, there are some gaps that would need to be filled in before we could call it a data mesh.
The lowest hanging fruit is a data catalog. It is probably telling that the data catalog, for all practical purposes, was the org chart. In an organization where the data is the business, this isn’t as bad as it would be in a normal company. You can pick up the phone and call any number, the person who picks up thinks about data from 9 to 5. Possibly longer. But a searchable data catalog would be better.
A bigger gap is the aforementioned lack of common knowledge and common terms. Talking about data across teams was hard work, data structures varied, and a lot of data could be moderately restructured to create a much more cohesive system that lowered the overhead when using data products.
This lack of knowledge was probably related to a lack of agreement about the divide between IT and the statistical departments. Some employees and even managers would not accept that they should ever have to see a single line of code. They were BI analysts, but without the word “BI” in their vocabulary. So the IT department got some support tickets about altering a where
statement. Other teams wouldn’t let anyone in the IT department even touch their pipelines. To them, the role of IT was to keep the servers humming and pay the SAS license. I have often wondered where these different attitudes come from. If your mental model is that SSB is like any other business and see the statistical teams as business teams, it makes sense to say that IT should handle the pipelines. But in a data mesh, the teams must be cross-functional, producing data involves data processing and requires data engineers and data scientists. I wonder how this is seen in other companies where data is the business.
But perhaps the biggest obstacle was that nobody accepted that what we had was accidentally similar to a data mesh. Nobody wanted a data mesh, nobody wanted a data warehouse, nobody wanted a data lake. Some argued for a weird graph-lake-semantic-overlay system that would have to be created from scratch and intended to abstract away even the concept of a dataset.