@May 12, 2025

Introduction

In recent years, the term "Lakehouse" has become quite a buzzword, and many companies have been building one internally. Before we discuss what it is and how it is organized, it's worth taking a brief look at the history and evolution of analytical data platforms. This will help us understand why things developed in certain ways and reveal the major challenges the data community had to overcome while developing a general-purpose data platform that could serve a variety of use cases.

The evolution of analytical data platforms can be marked by three major milestones. The first was the emergence of the Data Warehouse – a database architecture enabling BI use cases and providing businesses with oversight of critical processes and key metrics. The second was the rise of the Data Lake – a Hadoop-based and later cloud-based platform that could store all types of data while scaling out both storage and processing. The third was the advent of the Lakehouse – a multi-functional, general-purpose data platform that unified the benefits and advantages of both Data Warehouse and Data Lake architectures.

As I explain the history of analytical data platforms, I will walk you through many open-source tools, frameworks, and vendors that were crucial to this evolution. While the numerous names and references may seem overwhelming, it's essential to examine the key technologies that the community developed to address limitations in data platform architecture over time. I will strive to remain objective and avoid promoting any particular framework or vendor, though I may occasionally show slight preferences.

Traditional data warehouse

Relational database management systems have existed since the 1970s. Examples are Oracle, MSSQL Server, and later MySQL and PostgreSQL. While they served operational workloads well, they were not well suited for analytics. The main problems were:

Data models poorly suited for analytical queries
Inability to perform comprehensive analysis due to data spread across multiple databases
Risk of overloading production databases with resource-intensive analytical queries
Poor performance for analytical queries due to row-based storage and lack of analytical optimizations

Due to these limitations, analytical databases began emerging in the 1980s under the name Data Warehouse (DWH). Data from various operational systems would be consolidated into a central DWH, where analysts and BI engineers could query and join it. Two major architectural approaches emerged, pioneered by Bill Inmon and Ralph Kimball. Both approaches focused on ingesting operational data into the central DWH and transforming data from various sources into structures optimized for analytical queries. Some of the most well-known vendors that provided DWH solutions at that time were Oracle, Teradata, IBM, and Informix.

To summarize, here are the main properties of traditional data warehouse systems:

They primarily stored structured data. While they supported binary objects, the system was not optimized for them in terms of performance or cost.
They used proprietary storage formats, and users could not control or customize how data was physically organized.
They provided SQL with ACID transaction guarantees as their only API. While SQL was excellent for analytics and BI, it did not allow for custom processing, such as ML training.
Data models followed one of three approaches:

Ad hoc: Data is stored in multiple tables without a systematic approach to relationships or normalization levels. This typically includes just a staging layer for raw data and a layer for derived tables.
Inmon layered architecture: A three-layer system consisting of staging, normalized (3NF) business model, and data marts. The data marts use dimensional modeling (star/snowflake schema) to optimize data for specific business use cases.
Kimball layered architecture: A two-layer system similar to Inmon's, but data marts are created directly from staging data without an intermediate conceptual layer.

On-premises setup: Storage, compute, metadata, governance, and orchestration were tightly coupled within a single cluster. Scaling required manual intervention – adding servers, upgrading hardware, creating new shards, and rebalancing tables.
Data ingestion relied on internal mechanisms like INSERT or bulk COPY/LOAD commands, with limited or no support for streaming ingestion.

Traditional data warehouses primarily supported two main use cases: ad hoc analytics and BI dashboards. SQL was – and still remains – the primary interface for processing and extracting data. However, as data volumes grew and machine learning use cases emerged, new technologies became necessary to address the limitations of traditional data warehouses.

Data lake

Let's recap the key limitations of traditional data warehouses:

They could only store structured data
They offered SQL as the sole interface for data access and manipulation
They had limited scalability and extensibility

To address these limitations, a new wave of technologies emerged. Between 2003 and 2006, Google introduced several groundbreaking ideas: Google File System [1], MapReduce [2], and Bigtable [3]. This marked a revolutionary shift as data management became distributed and parallel. In 2006, Yahoo released Hadoop and began using it for production workloads. The following year, Yahoo published their paper on HDFS [4, 4a]. This period sparked a rapid expansion of tools in the Hadoop ecosystem.

Hadoop represented a major breakthrough that gave rise to the term "Big Data." It fundamentally shifted data processing from a few powerful servers with limited storage and processing capacity to virtually unlimited scalability through clusters of commodity machines.

A significant milestone was Facebook's release of Hive in 2008. Hive added SQL capabilities to big data processing and introduced the Hive Metastore (HMS) – a metadata service that tracked table schemas, partitions, and statistics. This innovation made Hadoop accessible to non-engineers and represented the first convergence between big data tools and traditional data warehouses.

In the same year, the NoSQL storage system HBase (inspired by Google's Bigtable) was initiated and adopted by Yahoo. It became an Apache project in 2010. HBase expanded Hadoop's capabilities from analytical to operational workloads while maintaining horizontal scalability. This marked another significant development in the Hadoop ecosystem.

During this period, several Hadoop vendors emerged in the market. The main players were Cloudera (2008), MapR (2009), and Hortonworks (2011). Eventually, Hortonworks merged with Cloudera, which continues to adapt to the evolving ecosystem, while the other Hadoop vendors have largely disappeared.

In 2011, Cloudera released Hue, a tool that provided a UI for running and navigating queries in the Hadoop ecosystem. It became the de facto standard interface in the years that followed.

In 2012, Yahoo released Hadoop's first job orchestrator, Oozie [5], followed by Spotify's release of Luigi around the same time. Three years later, in 2015, Airbnb introduced their scheduler, Airflow. These tools marked a significant advancement in the big data ecosystem, enabling the construction of data ingestion and transformation pipelines.

2013 was a pivotal year when accumulated experience with Hadoop's challenges led to significant breakthroughs. Yahoo published their paper on YARN [6] and released Hadoop 2.0, introducing a decoupled and flexible architecture. Spark [7], which began in 2010 as a solution to simplify MapReduce algorithms, became an Apache project and emerged as a versatile data transformation framework for ETL workflows. Cloudera released Impala [8] to overcome Hive's limitations by processing data independently of Hadoop's MapReduce. Facebook introduced Presto [9] as an enhanced alternative to Hive for ad hoc SQL queries, operating independently from Hadoop storage.

Additionally, in the same year, two new open file formats designed for analytical data processing were released as open source projects. Twitter and Cloudera introduced Parquet, while Facebook and Hortonworks introduced ORC. These similar columnar file formats offered key features for analytical workloads, including column projection, compression, splittability, complex types, schema evolution, and predicate pushdown.

This marked the period when the notion of Data Lake gained popularity. As the Hadoop ecosystem grew and matured, more companies began building data lakes to overcome the limitations of traditional data warehouses. The development of frameworks for data processing and querying, job orchestration, and columnar file formats made the data lake a common approach for advancing companies' analytical infrastructure.

Alongside the emergence and maturation of big data technologies, another crucial development was taking place – the rise of cloud platforms. This began with AWS in 2006, followed by Azure and GCP in 2013. The significance of these platforms lay in their two core services: object storage and compute instances (virtual machines). Object storage provided a foundation for decoupling the data from the cluster into separate infrastructure, making it inherently durable and scalable. Meanwhile, compute instances separated data processing into an independent layer of virtual machines that could scale horizontally to handle workloads of any size.

The combination of new data processing and storage frameworks with elastic cloud platform capabilities enabled management of virtually unlimited data volumes and on-demand processing through decoupled, horizontally scalable compute infrastructure. Object storage wasn't restricted to structured data – it could accommodate any file type, from Parquet, ORC, and Avro to CSV, JSON, text, images, and even video.

The architecture of data lakes started to become predominantly cloud-driven. While companies maintained their traditional data warehouses and moved some of their workloads into Hadoop, they began using object storage alongside them. They then built data pipelines using Spark, Presto, or Hive to process this data not only for analytics but also for other purposes, such as machine learning model training.

Here are the key characteristics of a cloud-based Data Lake system:

Ability to store structured, semi-structured, and unstructured data
Support for open storage formats (Parquet, ORC, Avro, CSV, JSON, text, images, etc.)
Distributed processing engines for data ingestion, processing, and querying (Hadoop MapReduce, Hive, Spark, Impala, Presto)
Orchestration frameworks (Oozie, Luigi, Airflow)
Cloud-based architecture

Complete decoupling of storage and compute
Cost-effective, virtually unlimited storage
Horizontal scalability with elastic compute resources

Limited or non-existent metadata management, with basic table definitions in the Hive Metastore
Lack of formal data modeling – datasets exist independently with schema-on-read approach
No support for ACID transactions
Absence of governance features for managing data quality, lineage, completeness, and freshness

Data Lake use cases emerged as data volumes kept growing. This growth stemmed from two main factors. First, data warehouses were reaching their capacity limits as they needed to store longer historical records and more business dimensions. Second, organizations began collecting more diverse types of data – machine-generated data became a valuable information source, while ML applications for text, image, and video data became increasingly common.

The "3 Vs of big data" (volume, velocity, variety) was a popular buzzword during this period. The Data Lake, initially built on Hadoop and later on cloud infrastructure, represented a crucial step in bridging the gap between traditional data warehouses and the growing need to extract value from expanding volumes and types of data. While Data Lakes solved the limitations of data warehouses regarding variability and scalability, they introduced their own constraints – lacking the robust features of data warehouses such as proper metadata management, CRUD operations, ACID transactions, and data governance. Addressing these limitations and combining the strengths of both systems required the emergence of next-generation technologies.

Lakehouse data platform

Let's recap the key limitations of the Data Lake platforms:

In Hadoop-based systems, scalability was restricted to cluster capacity, lacking true elasticity
Limited metadata capabilities (even with Hive Metastore), relying on schema-on-read with no formal data modeling
Absence of ACID transaction guarantees
Minimal data governance features requiring extensive manual intervention
Relatively immature processing engines, file formats, and supporting frameworks

Development continued and new technologies emerged to address these limitations. The most crucial missing piece was a mechanism for reading and writing data in an atomic way, along with the robust metadata management capabilities found in traditional data warehouse systems. The time had come to move beyond files and introduce table formats.

In 2015, Uber initiated project Hudi, creating the first table format. Hudi was open-sourced in 2017 and became an Apache project in 2020. Databricks developed Delta internally in 2017 and later open-sourced it as Delta Lake [10] under the Linux Foundation in 2019. Also in 2017, Netflix launched project Iceberg, which they open-sourced in 2018 and it was promoted to Apache in 2020. This period marked a major leap forward in bringing metadata management and ACID guarantees to the big data world.

Apache Spark served as the primary reference engine during this period and had connectors implemented early on. However, many other frameworks – including Trino (formerly Presto), Apache Kafka [11], Apache Flink [12], and Pandas [13] – took more time to develop their integrations. Today, this challenge has largely been resolved, with connectors becoming stable and mature across frameworks. Additionally, newer tools like DataHub and Airbyte have recently developed their own integrations.

One crucial aspect of the Iceberg project deserves special attention. Beyond its role as a table format, it includes an API specification for a metadata service called Iceberg REST Catalog. This API can be implemented by any party and serve as a central metadata service. Instead of engines working directly with object storage, they communicate through this service. It effectively replaces the Hive Metastore, but at a higher qualitative level – featuring native support for the Iceberg table format. This metadata service architecture delivers several critical functions, specifically:

Industry-standard REST API with simple client-side access
Serves as an enhanced replacement for HMS, managing namespace and table metadata
Handles conflict resolution for concurrent table writes
Provides centralized credential management and role-based access control for all engines
Has built-in TLS and token-based authentication
Potential to support multi-table transactions and other advanced functionalities in the future

In 2024, several open-source implementations emerged. Snowflake released Polaris, while two independent implementations – Lakekeeper and Gravitino – were also launched. Databricks introduced Unity, which aims to be a polyglot catalog supporting all table formats. While these implementations are still in their early stages and their future evolution remains uncertain, the rapid development in this space is both promising and exciting.

However, a modern data platform required more than just a table format with a metadata service – it needed a range of tools and frameworks to form a comprehensive ecosystem. Let's explore the key technologies that have been developed in recent years.

Perhaps the most foundational one is Kubernetes [14] (or K8s for short), which Google announced in 2014 as a new container orchestration and management system. After being open-sourced in 2015, it quickly became the dominant platform for containerizing cluster resources and running distributed applications. Various data frameworks began adopting K8s as their underlying operational layer. For example, Apache Spark introduced K8s as a new scheduler backend in 2018. Today, virtually all data systems and frameworks have been containerized, significantly simplifying usage of the compute resources in the cloud.

Several modern data ingestion tools emerged during this period. In 2014, NiFi, a no-code UI-based data ingestion framework, was open-sourced and quickly became an Apache project. In late 2015, Debezium launched as an open source project and established itself as the de facto standard for Change Data Capture (CDC) across databases and other systems. In 2016, dbt (Data build tool) was introduced as an open source project to enhance SQL with software engineering best practices, including modularity, reusability, dependency management, versioning, and testing. More recently, in 2020, Airbyte launched as an open source project and rapidly became the preferred solution for data movement across diverse systems.

To address Airflow's limitations, a new open source orchestration framework Dagster launched in 2018, offering a data-oriented, declarative, and developer-friendly approach. In the same year, another orchestration tool Prefect emerged as an open source project, introducing a more Python-native, developer-friendly execution model that eliminated the need for DAG definitions. Meanwhile, Airflow itself has continued to evolve and mature through versions 2.0 and 3.0, adding numerous features and resolving many of its original limitations.

Several data visualization and BI tools emerged and matured during this period. Apache Superset, launched as an open source project in 2016 and promoted to Apache in 2020, offers powerful and flexible analytics for data analysts and engineers. Metabase, which began as an open source project in 2015, provides a streamlined BI experience tailored for business users. Today, both tools have reached maturity and are widely adopted for production BI use cases across many companies.

Numerous data governance tools and frameworks have emerged in recent years. Notable examples include data quality and contract definition tools like deequ (2018) [15], Great Expectations (2018), and Soda (2022). The Apache project OpenLineage (2021) works to standardize job and dataset lineage tracking. For data cataloging and discovery, platforms such as DataHub (2019), Amundsen (2019), and OpenMetadata (2021) have been developed. The ecosystem also includes the general-purpose Open Policy Agent (2016) and the authorization framework OpenFGA (2022). Additionally, metadata services implementing the Iceberg REST Catalog typically provide role-based access control (RBAC).

A wide range of ML frameworks supporting distributed learning has emerged. Notable examples include Spark MLlib, XGBoost, PyTorch (DDP), TensorFlow, Horovod, Ray, Accelerate, Dask, Daft, and Petastorm. With AI/ML being such a hot topic today, it's not surprising that so many tools are emerging and evolving rapidly. These tools quickly integrate with existing systems and frameworks in the data ecosystem.

Cloud services have matured significantly over the past decade. AWS S3, now the most advanced object storage system, has introduced several major improvements: higher throttling limits per prefix, elimination of prefix randomization for hot spot prevention, Intelligent Tiering storage class, S3 Glacier Instant Retrieval, and strong consistency for both read-after-write and list-after-write operations. These enhancements, among many others, have established S3 as a stable and robust foundation for cloud-based data platforms.

These developments have created a rich ecosystem of well-integrated tools that provide a solid foundation for building modern data platforms. The main limitations of both traditional data warehouses and data lakes have been effectively resolved, and the best aspects of both worlds have converged into what we now call the Lakehouse data platform [16]. This evolution will continue, and the coming years will bring new milestones toward a fully decomposed, open-source-based data platform. Let's examine the main properties and capabilities of the Lakehouse concept, considering the current state of its essential tools:

Support for structured, semi-structured, and unstructured data

Rich type system for tabular data
Native support for semi-structured data with the variant type
Advanced data organization through partitioning, bucketing, and sorting to enable efficient partition pruning and data skipping

Open file and table formats serving as the foundation for data and metadata management
Centralized metadata service providing ACID guarantees and serializable snapshot isolation
Support for diverse APIs: SQL, DataFrames, and programming language SDKs
Flexible data ingestion through Apache Spark, Apache Flink, Trino, dbt, Airbyte, Debezium, and other tools
Cloud-native architecture with decoupled storage and compute, offering full elasticity and horizontal scalability
Containerized compute infrastructure
Data model

Hybrid approach using layers (staging, cleaned, aggregated)
Limited adoption of traditional relational or dimensional modeling

Emerging data governance with tools for access control, data quality, lineage, completeness, and freshness
Modern orchestration frameworks that are data-oriented, event-driven, well-integrated with processing engines, and run on Kubernetes
Comprehensive support for data retrieval and deletion to comply with GDPR/CCPA and similar regulations

The Lakehouse data platform supports various use cases. First, it enables traditional analytics for BI and ad hoc data analysis. Second, it facilitates ML training on large-scale data through its infrastructure and distributed learning frameworks. Third, near-realtime stream processing – previously more common to the operational domain with tools like Apache Kafka and Apache Flink – is becoming a first-class citizen of the Lakehouse platform. Finally, it supports customer-facing low-latency data-driven applications. These latter two use cases blur the line between OLTP and OLAP worlds, signaling an era where a central data platform can serve as the foundational element of a company's entire infrastructure.

Modern SaaS Data Warehouse and Lakehouse platforms

I would also like to describe an alternative approach that has emerged in recent years: the cloud-native Data Warehouse and Lakehouse platforms based on the SaaS model. Several vendors now provide such offerings, including Snowflake, Databricks, Dremio, Starburst, ClickHouse Cloud, Firebolt, Google BigQuery, Amazon Redshift, and Azure Synapse. Their approach contrasts with the open-source based Lakehouse platform – they usually offer a hybrid solution that employs some of the mentioned open-source frameworks, while having the majority of the functionality being proprietary. Many of them are effectively SQL-oriented data warehouse solutions that have overcome the core limitation of traditional data warehouses: the coupling of storage and compute that prevents elastic horizontal scaling. Loosely we can describe the main properties and capabilities of these systems as such:

Out-of-the-box data warehouse or lakehouse solution with simple, rapid onboarding
Cloud-native architecture with elastic, horizontal scalability
Decoupled storage and compute layers
Proprietary internal components (depending on the vendor) - storage format, query engine, jobs orchestration, data governance tools
Integration capabilities with external object storage and table formats like AWS S3 and Iceberg
Support for streaming ingestion and dynamic tables
Generally somewhat high cost structure
Significant vendor lock-in with dependence on vendor-driven feature development

These systems compete with the open-source Lakehouse approach by providing platforms that are easy to use and readily solve BI use case as well as other use cases, depending on the capabilities of the platform. This can be a clear advantage, particularly for companies without strong engineering capacity and know-how, that need a stable data warehouse or lakehouse solution. However, these systems have limitations in their supported use cases and create vendor dependency through their proprietary functionality. Both approaches therefore serve slightly different niches and have valid reasons to exist.

Conclusion

In this post, we have explored the history of analytical data platforms. We identified three major milestones: the traditional Data Warehouse, the Hadoop-based (and later cloud-based) Data Lake, and finally, the Lakehouse data platform. We examined how various tools and frameworks shaped this evolution, and how the data community addressed ecosystem limitations by continuously improving existing tools and creating new ones.

I have omitted several technologies from detailed discussion, including such Apache projects as Pig, Avro, Arrow, Drill, Druid, Calcite, DataFusion, Kafka, Flink, as well as general tools like Vault and Terraform, and many others. While aiming to be comprehensive, my goal was not simply to list every technology in the big data ecosystem. Instead, I focused on painting a holistic and cohesive picture of this evolution, highlighting the key tools that marked significant milestones along the way.

In conclusion, we now have most of the necessary ingredients to build a centralized, open-source driven, cloud-native, scalable, and extensible Lakehouse data platform. This platform can support all use cases that rely on large data volumes – from BI dashboards and ad hoc analysis to ML/AI model training, stream processing, and low-latency customer-facing applications. Clearly, the development and evolution of tools needed to build such platforms will continue in the near future.

In the next post, I will explore the actual architecture of the Lakehouse. I will examine its overall structure, core building blocks, and how these components relate to each other. Later, I plan to provide detailed descriptions of these components and their interactions, demonstrating their implementation and operation through practical examples.

Lakehouse series: Evolution of analytical data platforms