@June 30, 2025

Introduction

In the previous post of this series, I described the evolution of analytical data platforms. As a result, we arrived at a system called Lakehouse data platform – a centralized, open-source driven, cloud-native, scalable, and extensible data platform. In this post, I will explain how such a system is organized, what necessary layers and components it contains, and how it enables a variety of data-intensive use cases.

Overall architecture

Before diving into the structure and components of a Lakehouse data platform, I would like to outline the big picture by first presenting it as a black box in the context of its input and output. The central idea is to create a unified platform for all company data, collecting and organizing this data into convenient physical and logical formats, and making it available for a variety of data-intensive use cases.

First, we should mention data systems that commonly operate and contain data useful for analytics and other use cases. We call them data sources in the context of a data platform. They differ in a variety of ways, most notably in what kind of data they store and how it is stored and provided in the technical sense. Let's briefly cover the most common sources:

The first and most common source is an operational database (aka OLTP database) used by a particular production system, such as a microservice.
Additionally, most companies maintain a central messaging or event system (aka message bus).
Often, there are custom or legacy systems that output and store data simply in files, typically in common file formats such as CSV, JSON, or Parquet.
Another source is an API service that can be accessed through specific endpoints to obtain data in batches or as a stream.

We already discussed the use cases in the previous post when examining the evolution of analytical data platforms. Let's briefly list them again here for completeness:

Business Intelligence and analytics.
Data Science, Machine Learning, Artificial Intelligence.
Batch and streaming data extraction.
External access to the data.

Clearly, we cannot build a Lakehouse data platform in a vacuum – we need some basic infrastructure in place first, such as compute resources, storage systems, and other foundational components. I'll briefly outline these requirements in the following section.

We can now see the Lakehouse data platform from a bird's-eye view, with its inputs, outputs, and clear purpose. You can see it in the diagram below. We will extend it with more details as we cover additional aspects of the platform. At this point, we can move on to the overall structure, layers, and components of the Lakehouse data platform.

General infrastructure

Before we start designing and building any data platform, we need the most basic tools and services to be set up and ready. We can collectively call these the general infrastructure. Let's briefly list the main components of this infrastructure:

As the foundation for running all applications and jobs of the data platform, we need compute resources, e.g., AWS EC2, GCP GCE, or Azure VMs.
We need an object storage that can store files for various applications and serve as the central storage layer for the platform's data. This would typically be AWS S3, Google Cloud Storage, or Azure Blob Storage.
On top of the compute resources, we need containerization that can help us run all types of applications in containers and have resource management provided out of the box. The most common choice for this would be Kubernetes.
We need an orchestration tool to run data processing jobs and other auxiliary applications. Airflow, Prefect, and Dagster are the most popular choices in this space.
We need observability over all kinds of processes. The most basic approach would be to set up a range of open source tools like Prometheus, Loki, and Grafana, but many companies use SaaS solutions, such as Datadog, that provide a comprehensive suite of tools for observability, infrastructure monitoring, and beyond.
We need a general mechanism for identity management, such as LDAP, SAML, OIDC, or OAuth2. A variety of popular open source and paid tools implement these protocols, including Keycloak and Okta.
We need a CI/CD platform to build artifacts and deploy applications. Popular options today include GitLab and GitHub Actions.
Finally, we need a tool to store and access secrets, such as database credentials. A common choice for this purpose is Vault from HashiCorp.

Now we can extend the diagram to include components of the general infrastructure layer:

Data layer

Now, we finally come to the fundamental layer of any data platform – its data layer. This is a collection of mechanisms and tools that store and manage data, as well as the actual data itself. When we break it down into components, we end up with the following:

Before we can store the data in the object storage, we need to serialize it. For this we can use one of the open file formats with favorable properties for analytical use cases. Formats such as Parquet and ORC are mature, well-supported options.
As we briefly discussed in the previous post, the core of the Lakehouse data platform is a table format. There are three popular open source table formats currently available: Hudi, Delta Lake, and Iceberg. Table formats offer many useful features, but their most important and foundational capability is providing ACID guarantees when readers and writers interact with them.
I will use the Iceberg format as a reference throughout this series of blog posts – partly because it has become quite popular, but also because it has a specification for the REST Catalog that has now become a solid standard for the metadata service. This metadata service plays a central role in managing tables and facilitating the interaction between readers and writers with the data. It also handles and provides access to various useful metadata that not only helps in query execution but also offers valuable information about the state of the data, files, partitions, etc. You can implement the Iceberg REST Catalog API manually or use one of the open source solutions such as Polaris, Lakekeeper, Gravitino, or Unity.
Finally, with all the above components in place, we need a set of table maintenance processes to keep data and metadata in good condition. These include management of table snapshot history, optimization of data and metadata (e.g., compaction, sorting), enforcement of data retention policies, removal of data for compliance reasons (e.g., GDPR), and more.

Let's now focus on the "Lakehouse data platform" section of the diagram and begin with the data layer we've just discussed:

Ingestion layer

Now that we have mechanisms to store and manage data, we can begin ingesting it from the external sources discussed above. This introduces our next important set of tools and processes that we collectively call the ingestion layer. Since data in different sources is stored in various ways and has different growth and update patterns, we need specialized mechanisms to extract it, potentially transform it on the fly, and load it into our data layer. Let's examine some of the most common approaches:

Automated data ingestion using batch data integration tools like Airbyte or dltHub. This is a simple and convenient method to configure and run batch ELT when the data volumes are of moderate size. These tools have source connectors for most popular storage systems and can be a good starting point to load data into the Lakehouse.
Streaming data integration from sources like the central message bus, which is usually based on Kafka. In this case, we can use Kafka Connect directly or build a custom application on Spark or Flink.
Often, we need more than just a complete or incremental copy of all new records from a table in an external database – we need to sync all modifications and replicate the exact state of that table. For this purpose, we typically use a mechanism called Change Data Capture (CDC), combined with streaming data integration and a special "upsert" ingestion pattern. One of the most popular tools for CDC is Debezium.
Finally, we can implement completely custom batch and streaming ETL using Spark or Flink. This approach allows us to ingest data of virtually any volume while having full control to extract data from any source – in either complete or incremental mode – clean, transform, and prepare it as needed, and load it into the Lakehouse.

Let's add the ingestion layer to our diagram:

Serving layer

We need the last principal piece – to serve the data that we now have in the platform. The set of mechanisms that address this can collectively be called the serving layer. The specifics of various ways to serve the data depend on the different use cases that need to be powered. Let's have a look:

The most fundamental mechanism is to give access to data via SQL, for example using a popular distributed query engine Trino. This is a solid foundation for all dashboards and reports for BI, as well as for ad hoc data analysis.
To serve Data Science, ML, and AI, we need to provide tools and frameworks that can process large volumes of data for model training. Distributed model training frameworks such as Spark ML, RayDP, XGBoost, and many others can facilitate this process.
Beyond model training, data scientists and engineers often need to perform "in-process analytics" – analyzing a subset of data that fits into the memory of a regular Python application. For this reason, it's important to provide integration with embedded analytical tools such as DuckDB or Polars.
Often companies have an existing DWH system based on cloud platforms like Snowflake, BigQuery, etc., and there are valid use cases to access data from the Lakehouse and join it with the internal tables of DWH. For this reason, it's important to provide integration with external DWH. At this point, most cloud-based DWH systems have such integration mechanisms via so-called "external tables."

Now, adding the serving layer to our diagram completes the cycle, showing that we already have a basic Lakehouse data platform architecture:

Transform layer

Ingestion and serving layers, though playing different roles in the platform, have a common aspect – both layers process data and share certain commonalities. Also, beyond simply ingesting data and serving it as is, we typically want to transform it (aggregate, expand, join), often multiple times. For this reason, the data layer usually has its own layered architecture, either based on DWH practices (Kimball, Inmon, Data Vault, Anchor) or consisting of three loose sublayers (medallion). To address these transformation needs, we can group all data transformation tools and mechanisms into the transform layer. Let's briefly look at the components that form it:

First of all, batch and streaming ingestion with Spark and Flink can be extended into a general data processing and transformation component and moved into the transform layer. This is because we can use these technologies not only to ingest data into the platform but also to transform data within it and serve it for downstream use cases.
Distributed queries with Trino can also be moved into the transform layer, as it can be used for transformations in the data layer and even for ingestion from sources in certain situations – thanks to its rich set of connectors.
Last but not least, we can use tools for SQL-based data transformations that help standardize this process. Currently, the most popular tool is dbt, that can work on top of Spark or Trino. This powerful mechanism makes data transformations more DWH-like and introduces standard engineering practices into SQL development.

Now, with a slight rearrangement of the architecture, our diagram will look like this:

Data governance layer

So far we have covered all layers and components of the Lakehouse data platform that constitute its backbone and provide us with tools and processes to ingest, store, transform, and serve data. However, there is one crucial aspect we haven't yet addressed: data governance. This is essential because beyond ensuring data flows efficiently through the system, we must also properly govern it – making it discoverable and observable, enforcing schema and quality guarantees, controlling access, classifying sensitive attributes, and more. Let's examine the main components of the data governance processes needed for the Lakehouse:

First of all, we need to centralize all metadata to make data discoverable. For this purpose, there are frameworks called metadata platforms. The most popular open-source examples include DataHub, Amundsen, and OpenMetadata. These frameworks not only crawl and catalog your data but also provide a broad range of other functionalities for metadata management and data governance.
Another crucial aspect is data observability. This is a broad area that includes various aspects such as data quality, lineage, freshness, monitoring of characteristics (e.g., volume, counts), and alerting for anomalies. The metadata platforms mentioned above provide functionalities to support most of these observability aspects. It's worth noting that there are specialized tools for data quality, such as deequ, Great Expectations, and Soda, as well as tools for data lineage, like OpenLineage.
We have already covered a metadata service (Iceberg REST Catalog) in the data layer of the platform, but there is an extension that we might want to add to it. We can call it schema registry, similar to the one in the Kafka platform from Confluent. Its main purpose would be maintaining table definitions in a declarative format (e.g., Avro Schema, Protobuf, JSON Schema, or simply YAML) and helping to automatically provision actual tables. Additionally, the schema registry could manage schema validation and versioning, control change compatibility, generate schema as code in languages like Python or Scala for use in Spark, and potentially even store definitions of data contracts. To my knowledge, there are currently no open source solutions for this component.
One more important aspect of data governance is access control. The most standard approach is role-based access control (RBAC), which essentially allows granting access to resources (tables, views) for principals (users, roles). All open source implementations of the Iceberg REST Catalog provide centralized access control mechanisms. They all support RBAC, though they employ slightly different approaches – for example, RBAC in Polaris, OpenFGA (RBAC/ABAC) in Lakekeeper, and RBAC/DAC in Gravitino. Regarding fine-grained access control (FGAC) – that limits the rows accessible to a particular user or role during a query – the catalog cannot control this functionality. This is because row-level filtering can only be implemented by the query engine at runtime. To my knowledge, there is currently no out-of-the-box solution for this aspect.
We also need to classify and mark sensitive data, such as financial information or personally identifiable information (PII), so that access to it can be restricted. We can call this component data classification. Some of the metadata platforms mentioned above support automated classification; however, this process typically requires additional curation.

At this point, we have completed all layers of our Lakehouse data platform:

Complete architecture

Now that we've explored all layers and components, we can assemble everything to present the complete architecture of the Lakehouse data platform. This architecture encompasses data sources, supported use cases, and the underlying infrastructure. Here is the final diagram:

Conclusion

In this post, we've examined the architecture of the Lakehouse data platform. We've explored the use cases it enables, the common data sources it must support, and the general infrastructure it requires. Most importantly, we've outlined what the platform actually is – how it's organized into layers and components, and what their purpose and functionality are. This represents a high-level blueprint of a typical Lakehouse system, with each component requiring individual attention. In future posts, I'll address some of these components separately and demonstrate how they can be implemented in practice.

Lakehouse series: What is a Lakehouse data platform?