Lakehouse series: Introduction

@April 22, 2025

In January 2021, Databricks published a white paper [1] introducing a novel data platform architecture they called the Lakehouse. This marked a significant milestone in the evolution of data platforms built to serve OLAP use cases - such as powering Business Intelligence dashboards, training Machine Learning models, and performing a wide range of analytics on enterprise data.

Since then, the Lakehouse architecture has been actively developed and adopted across the tech community, with a growing ecosystem of tools evolving around it. These tools span a variety of categories: data processing and query engines, data catalogs, workflow orchestrators, and - crucially - table formats. Yet even today, the vision of a complete, end-to-end Lakehouse platform is still coming together. There is no universally accepted blueprint for building one easily and reliably within an organization.

In this blog series, I’ll aim to describe what a modern Lakehouse system looks like today and explore the key tools and frameworks needed to build one in practice. I’ll begin with a brief look at its history, purpose, and high-level architecture. Then, I’ll dive into each of its core layers and components, examining the open-source frameworks commonly used to implement them. Finally, I’ll bring everything together to show how these pieces fit into a complete, end-to-end system.

Even though the concept of the Lakehouse data platform has been around for several years - and there’s no shortage of blog posts and books on the topic - I believe there’s still a gap: a holistic, practical explanation that can serve as a solid starting point for anyone looking to build their own Lakehouse system.

A little about me. I’ve been working with databases, data platforms, and data processing since 2009. I wrote my thesis for a Specialist degree on data management systems, and my Master’s thesis on the Lambda architecture. Since 2018, I’ve been building Lakehouse systems and other data platforms and pipelines.

I helped build my first Lakehouse platform at Zalando - back when the term Lakehouse didn’t exist yet and Delta format had just been released. Later, I supported the development of another one at HelloFresh. Most recently, over the past few years, I’ve been deeply involved in building a Lakehouse platform on top of Iceberg at Datadog.

References

[1] Michael Armbrust, Ali Ghodsi, Reynold Xin, Matei Zaharia - Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics