The Other Platform Question

· 597 words · 3 minute read

The hype has subsided now, but you can still see it: The stack-fixation. Data teams comparing their data stacks, as if some magical combination of open-source and SaaS tools would solve all the problems. Fortunately few really believed a SaaS would save the world, but it could seem like it at times. Because tools are easy to talk about. The second easiest thing to talk about is how we shouldn’t talk about tools. But if not tools, then… what?

After seeing a few data platforms, one general question sticks out: Where do you set the bar? In part, this is the age-old question of code vs no-code. Since we all use dbt nowadays we have at least in part answered that question - SQL is a minimum requirement on most platforms. But beyond that, questions abound.

But code isn’t the same as complexity. Ask anyone who have used Azure Data Factory and tried to implement incremental load or simply add a loaded_at field. Compare that to Fivetran and Meltano or Airbyte or whatever, where stuff like incremental loads, schema drift and metadata-fields come for free as part of the abstraction.

Another part of that question is the division of responsibility between the platform team and domain teams. In cloud-land, there is a very classic illustration of the shared responsibility model, delineating what the cloud vendor is responsible for and what the customer is responsible for. There are dozens of illustrations of this. Here is Microsoft’s:

Microsoft shared responsibility model

What might this look like in a data context? Some topics to consider:

  • Access control. Are data teams responsible for granting necessary data access?
  • CI pipelines. Are teams responsible for setting up a CI/CD system they like?
  • Data loss prevention. Is the data team responsible for setting retention (failsafe) times and preventing accidental deletion?
  • Developer best-practices. How much guidance should the platform team be expected to give? Can the platform team dictate ways of working?
  • Development environments. Does the platform provide a development environment? Is it mandatory or just an offer?
  • What functionality does the platform offer? If the platform doesn’t offer it, can the data teams provide it themselves?
    • Transformation?
    • Data ingest?
    • Metadata catalog?
    • Reverse ETL?
    • Orchestration?

A second aspect is the self-service vs white-glove approach. Self-service sounds good, but it is difficult to create a good self-service platform that doesn’t require a a fairly competent data team. The platforms that aim to empower the one- or two-person teams around the business need to provide very smooth onboarding that is hard to accomplish without handholding.

Perhaps it is possible to define the data equivalents of IaaS, PaaS and SaaS for data platforms. By that I don’t mean direcctly equivalents, but similar levels of support.

FunctionalityData-on-a-platterData-on-a-servicedata-on-a-platform
Business logic (SQL)Data teamData teamData team
Ways of workingPlatform teamData teamData team
CI/CD systemPlatform teamSharedData team
Access managementPlatform teamPlatform teamData team
Data retentionPlatform teamPlatform teamData team
Development environmentPlatform teamPlatform teamData team
Data catalogPlatform teamPlatform teamPlatform team
DatabasePlatform teamPlatform teamPlatform team

You don’t have to agree with these. Feel free to add you own as well, both rows and columns. But it might be a starting point. A catalyst for debate. And for now, my markdown table is content with that.

Additional tables can be added, such as support level, platform cababilities, personas it targets, etc. But my rambling is done. Instead of talking about what software we use, we can talk about what demands platforms make of the users.