Explaining Data Lakehouse as Cloud-native DW

In this article I focus on how the data lakehouse architecture compares with the classic data warehouse architecture. I imagine the data lakehouse architecture as an attempt to implement some of the core requirements of data warehouse architecture in a modern, cloud-native design. I will explore the advantages of cloud-native design, including the ability to dynamically provision resources in response to specific events, predetermined patterns, and other triggers. I also explore data lakehouse architecture as its own unique approach to addressing new or different types of practices, use cases, and consumers.

In an important sense, data lakehouse architecture is an effort to adapt the data warehouse and its architecture to the cloud, while also addressing a larger set of novel use cases, practices, and consumers. This claim is not as counterintuitive or daunting as it may seem. We can think of data warehouse architecture as a technical specification that enumerates and describes the set of requirements (features and capabilities) that the ideal data warehouse system must address, but does not specify how to design or implement the data warehouse. Designers are free to engineer their own novel implementations of the warehouse, such as what Joydeep Sen Sarma and Ashish Thusoo attempted with Apache Hive, a SQL interpreter for Hadoop, or what Google did with BigQuery, its NoSQL query-as-a-service offering.

The data lakehouse is a similar example. If a data lakehouse implementation addresses the set of requirements specified by data warehouse architecture, it can be considered a data warehouse.

In the What is Data Lakehouse? – Unstructured Data Quick Tips (unstructureddatatips.com), we saw that data lakehouse architecture differs from the monolithic design of classic data warehouse implementations and the more tightly coupled designs of big data-era platforms like Hadoop+Hive or PaaS warehouses like Snowflake.

So, how is data lakehouse architecture different and why?

Adapting Data Warehouse Architecture to Cloud

The classic implementation of data warehouse architecture is based on outdated expectations, especially regarding how the warehouse’s functions and resources are instantiated, connected, and accessed. For example, early implementers of data warehouse architecture expected the warehouse to be physically implemented as an RDBMS and for its components to connect to each other using a low-latency, high-throughput bus. They also expected SQL to be the only way to access and manipulate data in the warehouse.

Another expectation was that the data warehouse would be online and available all the time, and its functions would be tightly coupled to each other. This was a feature of its implementation in an RDBMS, but it made it impractical (and impossible) to scale the warehouse’s resources independently.

None of these expectations are true in the cloud. We are familiar with the cloud as a metaphor for virtualization, which is the use of software to abstract and define various virtual resources, and for the scale-up/scale-down elasticity that is a defining characteristic of the cloud.

However, we may not spend as much time thinking about the cloud as a metaphor for event-driven provisioning of virtualized hardware, and the ability to provision software in response to events.

This on-demand dimension is arguably the most important practical benefit of the cloud’s elasticity and a significant difference between the data lakehouse and the classic data warehouse.

The Data Lakehouse as Cloud-native Data Warehouse

Event-driven design on this scale requires a different set of hardware and software requirements, which cloud-native software engineering concepts, technologies, and methods address. Instead of monolithic applications that run on always-on, always-available, physically implemented hardware resources, cloud-native design allows developers to instantiate discrete software functions as loosely coupled services in response to specific events. These loosely coupled services correspond to the functions of an application, and applications are composed of these loosely coupled services, like the data lakehouse and its layered architecture.

What makes the data lakehouse cloud-native? It is cloud-native when it decomposes most, if not all, of the software functions implemented in data warehouse architecture. These functions include:

        • One or more functions that can store, retrieve, and modify data;
        • one or more functions that can perform various operations (such as joins) on data;
        • one or more functions that expose interfaces for users and jobs to store, retrieve, modify data and specify different types of operations to perform on data;
        • one or more functions that manage and enforce data access and integrity safeguards;
        • one or more functions that generate or manage technical and business metadata;
        • one or more functions that manage and enforce data consistency safeguards when two or more users/jobs try to modify the same data simultaneously or when a new user and job tries to update data currently being accessed by prior users/jobs.

Using this as a guideline, we can say that a “pure” or “ideal” implementation of data lakehouse architecture would include:

      • The lakehouse service itself, which in addition to SQL query provides metadata management, data federation, and data cataloging capabilities. It also serves as a semantic layer by creating, maintaining, and versioning modeling logic, such as denormalized views applied to data in the lake.
      • The data lake, which at minimum provides schema enforcement and the ability to store, retrieve, modify, and schedule operations on objects/blobs in object storage. It also usually provides data profiling and discovery, metadata management, data cataloging, data engineering, and optionally data federation capabilities. It enforces access and data integrity safeguards across its zones and ideally generates and manages technical metadata for the data in these zones.
      • An object storage service that provides a scalable, cost-effective storage substrate and handles the work of storing, retrieving, and modifying data stored in file objects.

There are different ways to implement the data lakehouse. One option is to combine all these functions into a single omnibus platform, a data lake with its own data lakehouse, like what Databricks, Dremio, and others have done with their data lakehouse implementations.

Why Does Cloud-native Design Matter?

This raises some obvious questions. Why do this? What are the advantages of a loosely coupled architecture compared to the tightly integrated architecture of the classic data warehouse? As mentioned, one benefit of loose coupling is the ability to scale resources independently of each other, such as allocating more compute without adding storage or network resources. It also eliminates some dependencies that can cause software to break, so a change in one service will not necessarily impact or break other services, and the failure of a service will not necessarily cause other services to fail or lose data. Cloud-native design also uses mechanisms like service orchestration to manage and address service failures.

Another benefit of loose coupling is the potential to eliminate dependencies from reliance on a specific vendor’s or provider’s software. If services communicate and exchange data with each other solely through publicly documented APIs, it should be possible to replace a service that provides a set of functions (like SQL query) with an equivalent service. This is the premise of pure or ideal data lakehouse architecture, where each component is effectively commoditized (with equivalent services available from major cloud infrastructure providers, third-party SaaS and/or PaaS providers, and as open-source offerings) and reduces the risk of provider-specific lock-in.

The Data Lakehouse as Event-driven Data Warehouse

Cloud-native software design also expects the provisioning and deprovisioning of the hardware and software resources for loosely coupled cloud-native services to happen automatically. In other words, provisioning a cloud-native service means provisioning its enabling resources, and terminating a cloud-native service means to deprovision these resources. In a way, cloud-native design wants to make hardware and to some extent software disappear as variables in deploying, managing, maintaining, and especially scaling business services.

From the perspective of consumers and expert users, there are only services – tools that do things.

For example, if an ML engineer designs a pipeline to extract and transform data from 100 GBs of log files, a cloud-native compute engine will dynamically provision compute instances to process the workload. Once the engineer’s workload finishes, the engine will automatically terminate these instances.

Ideally, neither the engineer nor the usual IT support people (DBAs, systems and network administrators, so forth) need to do anything to provision these compute instances or the software and hardware resources they depend on. Instead, this all happens automatically – for example, in response to an API call initiated by the engineer. The classic on-premises data warehouse was not designed with this kind of cloud-native, event-driven computing paradigm in mind.

The Data Lakehouse as Its Own Thing

The data lakehouse is supposed to be its own thing, providing the six functions listed above. However, it depends on other services – specifically, an object storage service and optionally a data lake service – to provide basic data storage and core data management functions. In addition, data lakehouse architecture implements novel software functions that have no obvious parallel in classic data warehouse architecture and are unique to the data lakehouse. These functions include:

      • One or more functions that can access, store, retrieve, modify, and perform operations (like joins) on data stored in object storage and/or third-party services. The lakehouse simplifies access to data in Amazon S3, AWS Lake Formation, Amazon Redshift, so forth
      • One or more functions that can discover, profile, catalog, and/or facilitate access to distributed data stored in object storage and/or third-party services. For example, a modeler creates denormalized views that combine data stored in the data lakehouse and in the staging zone of an AWS Lake Formation (a data lake), and designs advanced models incorporating data from an Amazon Redshift sales data mart.

However, in this respect, the lakehouse is not different from a PaaS data warehouse service, which we will explore in depth in future articles.

What is Data Lakehouse?

What is Data Lakehouse?

This article on the data lakehouse will aim to introduce the data lakehouse and describe what is new and different about it.

The Data Lakehouse Explained

The term “lakehouse” is derived from the two foundational technologies the data lake and the data warehouse. Lakehouse is a concept or data paradigm that can be built using different set of technologies to fulfill the objectives.

At a high level, the data lakehouse consists of the following components:

      • Data lakehouse
      • Data lake
      • Object storage

The data lakehouse describes a data warehouse-like service that runs against a data lake, which sits on top of an object storage. These services are distributed in the sense that they are not consolidated into a single, monolithic application, as with a relational database. They are independent in the sense that they are loosely coupled or decoupled — that is, they expose well-documented interfaces that permit them to communicate and exchange data with one another. Loose coupling is a foundational concept in distributed software architecture and a defining characteristic of cloud services and cloud-native design.

How Does the Data Lakehouse Work? 

From the top to the bottom of the data lakehouse stack, each constituent service is more specialized than the service that sits “underneath” it.

      • Data lakehouse: The data lakehouse is a highly specialized abstraction layer or a semantic layer. That exposes data in the lake for operational reporting, ad hoc query, historical analysis, planning and forecasting, and other data warehousing workloads.
      • Data lake: The data lake is a less specialized abstraction layer. That schematizes and manages the objects contained in an underlying object storage service, and schedules operations to be performed on them. The data lake can efficiently ingest and store data of every type. Like structured relational data (which it persists in a columnar object format), semi structured (text, logs, documents), and un or multi structured (files of any type) data.
      • Object storage: As the foundation of the lakehouse stack, object storage consists of an even more basic abstraction layer: A performant and cost-effective means of provisioning and scaling storage, on-demand storage.

Again, for data lakehouse to work, the architecture must be loosely coupled. For example, several public cloud SQL query services, when combined with cloud data lake and object storage services, can be used to create the data lakehouse. This solution is the “ideal” data lakehouse in the sense that it is a rigorous implementation of a formal, loosely coupled architectural design. The SQL query service runs against the data lake service, which sits on top of an object storage service. Subscribers instantiate prebuilt queries, views, and data modeling logic in the SQL query service, which functions like a semantic layer. And this whole solution is the data lakehouse.

This implementation is distinct from the data lakehouse services that Databricks, Dremio, and others market. These implementations are coupled to a specific data lake implementation, with the result that deploying the lakehouse means, in effect, deploying each vendor’s data lake service, too.

The formal rigor of an ideal data lakehouse implementation has one obvious benefit: It is notionally easier to replace one type of service (for example, a SQL query) with an equivalent commercial or open-source service.

What Is New and Different About the Data Lakehouse?

It all starts with the data lake. Again, the data lakehouse is a higher-level abstraction superimposed over the data in the lake. The lake usually consists of several zones, the names, and purposes of which vary according to implementation. At a minimum, Lakehouse consist of the following:

      • one or more ingest or landing zones for data.
      • one or more staging zones, in which experts work with and engineer data; and
      • one or more “curated” zones, in which prepared and engineered data is made available for access.

Usually, the data lake is home to all an organization’s useful data. This data is already there. So, the data lakehouse begins with query against this data where it lives.

It is in the curated (GOLD) zone of the data lake that the data lakehouse itself lives. Although it is also able to access and query against data that is stored in the lake’s other zones. In this way the data lakehouse can support not only traditional data warehousing use cases, but also innovative use cases such as data science and machine learning and artificial intelligence engineering.

The following are the advantages of the data lakehouse.

  1. More agile and less fragile than the data warehouse

Querying against data in the lake eliminates the multistep process involved in moving the data, engineering it and moving it again before loading it into the warehouse. (In extract, load, transform [ELT], data is engineered in the warehouse itself. This removes a second data movement operation.) This process is closely associated with the use of extract, transform, load (ETL) software. With the data lakehouse, instead of modeling data twice — first, during the ETL phase, and, second, to design denormalized views for a semantic layer, or to instantiate data modeling and data engineering logic in code — experts need only perform this second modeling step.

The result is less complicated (and less costly) ETL, and a less fragile data lakehouse.

  1. Query against data in place in the data lake

Querying against the data lakehouse makes sense because all an organization’s business-critical data is already there — that is, in the data lake. Data gets stored into the lake from sensors and other sources, from workload, business apps and services, from online transaction processing systems, from subscription feeds, and so on.

The strong claim is that the extra ability to query against data in the whole of the lake — that is, its staging and non-curated zones — can accelerate data delivery for time-sensitive use cases. A related claim is that it is useful to query against data in the lakehouse, even if an organization already has a data warehouse, at least for some time-sensitive use cases or practices.

The weak claim is that the lakehouse is a suitable replacement for the data warehouse.

  1. Query against relational, semi-structured, and multi-structured data

The data lakehouse sits atop the data lake, which ingests, stores and manages data of every type. Moreover, the lake’s curated zone need not be restricted solely to relational data: Organizations can store and model time series, graph, document, and other types of data there. Even though this is possible with a data warehouse, it is not cost-effective.

  1. More rapidly provision data for time-sensitive use cases

Expert users — say, scientists working on a clinical trial — can access raw trial results in the data lake’s non-curated ingest zone, or in a special zone created for this purpose. This data is not provisioned for access by all users; only expert users who understand the clinical data are permitted to access and work with it. Again, this and similar scenarios are possible because the lake functions as a central hub for data collection, access, and governance. The necessary data is already there, in the data lake’s raw or staging zones, “outside” the data lakehouse’s strictly governed zone. The organization is just giving a certain class of privileged experts early access to it.

  1. Better support for DevOps and software engineering

Unlike the classic data warehouse, the lake and the lakehouse expose various access APIs, in addition to a SQL query interface.

For example, instead of relying on ODBC/JDBC interfaces and ORM techniques to acquire and transform data from the lakehouse — or using ETL software that mandates the use of its own tool-specific programming language and IDE design facility — a software engineer can use preferred dev tools and cloud services, so long as these are also supported by team’s DevOps toolchain. The data lake/lakehouse, with its diversity of data exchange methods, its abundance of co-local compute services, and, not least, the access it affords to raw data, is arguably a better “player” in the DevOps universe than is the data warehouse. In theory, it supports a larger variety of use cases, practices, and consumers — especially expert users.

True, most RDBMSs, especially cloud PaaS RDBMSs, now support access using RESTful APIs and language-specific SDKs. This does not change the fact that some experts, particularly software engineers, are not — at all — charmed of the RDBMS.

Another consideration is that the data warehouse, especially, is a strictly governed repository. The data lakehouse imposes its own governance strictures, but the lake’s other zones can be less strictly governed. This makes the combination of the data lake + data lakehouse suitable for practices and use cases that require time-sensitive, raw, lightly prepared, so on, data (such as ML engineering).

  1. Support more and different types of analytic practices.

For expert users, the data lakehouse simplifies the task of accessing and working with raw or semi-/multi-structured data.

Data scientists, ML, and AI engineers, and, not least, data engineers can put data into the lake, acquire data from it, and take advantage of its co-locality with an assortment of intra-cloud compute services to engineer data. Experts need not use SQL; rather, they can work with their preferred languages, libraries, services and tools (notebooks, editors, and favorite CLI shells). They can also use their preferred conceptual vocabularies. So, for example, experts can build and work with data pipelines, as distinct to designing ETL jobs. In place of an ETL tool, they can use a tool such as Apache Airflow to schedule, orchestrate, and monitor workflows.

Summary

It is impossible to untie the value and usefulness of the data lakehouse from that of the data lake. In theory, the combination of the two — that is, the data lakehouse layered atop the data lake — outperforms the usefulness, flexibility, and capabilities of the data warehouse. The discussion above sometimes refers separately to the data lake and to the data lakehouse. What is usually, however, is the co-locality of the data lakehouse with the data lake — the “data lake/house,” if you like.