In the shadow of LLMs

· 862 words · 5 minute read

A few years ago, helped along by zero interest rates, the data space was buzzing. A lot of new companies, mostly SaaS, a lot of new features, frameworks and libraries. To borrow a phrase, it felt like running in front of a train.

Now, with higher interest rates and LLMs devouring most of the VC money, the data space is a lot quieter. There doesn’t seem to be a MAD data landscape this year, but if a new one comes we for the first time see fewer entrants - particularly if we omit the LLM-specific stuff.

Metaplane has been bought by DataDog. Metaphor has been bought by KPMG. Stemma has been bought by Teradata. Data.world has been bought by ServiceNow. Exciting new companies are consumed by old boring enterprise’y ones.

And there are few new entrants. Our eyes are fixed on LLMs, agents, Copilot IDEs and vibe-coding.

What we are talking about 🔗

While there are few new entrants, newly popular entrants or major new features, there are some bright spots.

  • DuckDB is all the rage, slightly helped by the MotherDuck, the duck-as-a-service provider. There is something comical about a bunch of database IT people fawning over a database designed for people who can’t stand databases. But DuckDB is cool, lightweight and fast.
  • Iceberg is perhaps even bigger nowadays. For whatever reason, everybody needs something that can walk and talk like a database table but doesn’t have a database attached and can be read and written to by any number of systems. So the database has been deconstructed into storage, structure, catalog and bring-your-own-compute.
  • Airflow 3 has been released, with some cool new features including data assets. Dagster made a bet on software defined assets several years ago now, and became a darling of many data engineers despite (or perhaps because of) the pile of abstractions and complexity it brings with it.
  • SQLMesh seems to be slowly gaining interest. In short, SQLMesh is an interesting option of your data volumes makes it unadvisable for your developers to run dbt build uncritically. It is also interesting if you don’t feel like using jinja. Generally, SQLMesh seems to have touched a wish for database-agnostic SQL through transpiling (translating) SQL queries.
  • dlt seems oddly popular, as a python-native extract/load tool. It writes its state to the target database, which clutters the target, but if you are OK with that, it might be a useful option for everyone who don’t like the GUI SaaS options but want some minimum of convenience when moving data.

What we aren’t talking about 🔗

There are some topics that seems to have disappeared.

  • Data catalogs used to be a whole area of study. Metaphor, SelectStar, Stemma and Acryl were launched incredibly close to each other. Better metadata was by far the most mentioned answer on Tobias Macy’s Data Engineering Podcast when interviewees were asked what the biggest lack in tooling or technology was. But I haven’t heard talk about data catalogs in the wild for a long time.
  • Vector databases. Granted, this has very much to do with LLMs, but in the start of the LLM/RAG craze we were all looking for vector databases - and lots of companies started making them. It was a wild ride. And then, Databricks, Postgres and Snowflake added vector type columns. And we all agreed that vector columns were all we needed. Keep it boring. Use pg_vector.
  • It used to be I couldn’t go a week without seeing some hot-take from Barr Moses, the founder of MonteCarlo Data. Metaplane and Anomalo were far behind, but still interesting entrants. But the hot-takes are gone now, and the blog is drowning in attempts at making data observability about LLMs.
  • The next best Notebook. As a consequence of Jupyter’s success, it left a whole lot of feature requests in its wake. SaaS startups like Mode, Hex and Hyperquery were out there to offer their very own takes on an enterprise-ready notebook-as-a-service. Now, Mode has been bought by HubSpot and Hyperquery by Deepnote. Hex is still out there, somewhere.

This is not the consolidation we ordered 🔗

Right from the start, we knew that the drawback of the “Modern data stack” was that everything was a service. Some of us were praying for consolidation. Maybe Dagster could buy Marquez and SelectStar to create some data catalog observability orchestration one-stop-shop. Or maybe dbt could have bought Fivetran and Preset to offer a slim end-to-end DWH solution. Nothing like that happened. Instead we got Microsoft Fabric vying to be the next Informatica, and a whole lot of strange horizontal acquisitions. DataDog acquiring Metaplane makes sense in a weird way, but it isn’t exactly consolidating anything in the dataverse. I still have no real idea what ServiceNow does, I wish them good luck with the knowledge graph geeks at data.world.

The ICE stack 🔗

ICE is short for Interoperable, Composable, Efficient. Basically a way to say Iceberg with half the syllables. I’d give even odds that this expression catches on. It is ironic that ICE is about decoupling storage formats and compute, fragmenting a data platform further at the same time as we got tired of the “Modern data stack” because it was too fragmented.