Everything you really need to know about a data lakehouse
16 August 2022
Over the past years, we’ve seen substantial changes in how data is collected, organized, stored, and accessed. A few years ago, data was commonly stored in a data warehouse. Then, data lakes gained increasing popularity with the rise of cloud computing. As of today, data lakehouses are the talk of the town when it comes to data architecture. But why is that? And why is that happening right now? Let's take a refreshing dive into the history of data warehouses, data lakes, and data lakehouses.
Data warehouses are just perfect for BI reporting
A data warehouse is a central repository where data is stored in one or more relational databases. The key takeaway is that a data warehouse is for structured data that has been processed with a specific purpose. Such ETL (Extract, Transform & Load) processes deliver a solid basis for faster, more performant data analysis because the data is stored in a way that perfectly meets the needs of business end-users. Data warehouses, therefore, have a long history in BI reporting applications. For more than seven years, Datashift BI professionals have helped companies develop and execute strategies to build a future-proof data warehouse.
But while the data in a data warehouse is easy to access and use, the downside is that ETL processes can be time-consuming and only prepare the data for your original purpose. And then, only a tiny part of all data is available as structured data. Over the past years, we have seen an explosion of semi-structured and unstructured data such as video, images, JSON files, geospatial data, or sensor data from IoT devices - just to give a few examples.
Data lakes created new opportunities for exploiting ever-growing amounts of data
Data lakes overcome those drawbacks. They are best described as vast pools of raw data for which the purpose has not been defined yet. Using a schema on read, structure is applied only when the data is read.
Especially with the availability of low-cost cloud storage, data lakes have come in vogue over the past years. Another factor that has contributed to their growing popularity is that additional components can be plugged in with modest effort to develop new data science, machine learning, or streaming analytics use cases. That makes it even possible, for example, to build a centralized data hub acting as a 360-degree layer surrounding your data sources and helping you step up customer engagement.
A data lakehouse combines the best of both worlds
Compared to a data warehouse, data lakes lack some features, though. In particular, they tend to struggle with data quality, transaction support (to ensure data consistency when multiple parties read or write data concurrently), data governance, and query performance. Therefore, data teams sometimes knit both systems together at the added cost of duplicating data, increasing operational overhead, and raising infrastructure costs.
This is where a data lakehouse steps in. A data lakehouse combines the kind of data structures and data management features that you’ll typically find in a data warehouse with the low-cost storage of a data lake. Just like a data lake, it provides a single system to store structured, semi-structured, and unstructured data. In addition, it eliminates data duplication and ensures access to up-to-date data for BI reporting, data science, and machine learning applications.
Cloud computing has been a critical enabler for data lakehouses, covering aspects of ingesting, transforming, and serving data.
Data ingestion
Ingesting data is cheap (the cost for 2TB is only about 40 EUR per month), especially when data is stored in the compressed Apache Parquet file format. Thanks to their columnar structure, Apache Parquet files support significantly faster processing.
In addition to reducing storage needs and speeding up processing, Apache Parquet has the benefit of being open-source – eliminating the risk of vendor lock-in. Its sole disadvantage is that it doesn’t allow for file updates. Delta Lake, an open-source data storage layer developed by Databricks, is an interesting alternative as it stores data in the Parquet file format with an additional layer providing update, delete and merge capabilities.
Data lake design patterns
Using special data lake design patterns, we can organize the data lakehouse in layers where each layer adds business value compared to the previous layer. The data stored in either Apache Parquet or Delta Lake can be cleaned, transformed into dimensions and facts, and finally converted into business data sets in the data lake.
Transforming data
Transforming data can typically be achieved with a serverless pool of Apache Spark applications. Apache Spark is a parallel processing framework that supports in-memory processing to boost the performance of big data analytics applications. This open-source framework, which works very well with Parquet files, supports the preparation and processing of large volumes of data through multiple languages, including Python, Scale, #NE, and SQL. Important to mention is that Apache Spark pools scale quickly so that they can grow as your data sets get larger over time. So, once again, you only pay for what you use. Just to give an idea, if you would be using 3 four-core Spark clusters running for 2 hours every night, you’d be paying about 80 EUR per month.
Serving data
Serving data is really straightforward because all data can be accessed where it lives (on the data lake) and is queried using a SQL serverless pool. And it’s cheap. For example, if you would be querying 2TB a month, the costs would only amount to 5 EUR per month.
A scalable system that you can build on any cloud platform
Another additional benefit of data lakehouses is that they decouple storage from computing. Both aspects can be managed separately from each other. Both the storage and the compute clusters can scale autonomously, depending on whether more concurrent users are needed or more significant data volumes have to be processed.
You can build a data lakehouse on any cloud platform, whether Microsoft Azure, Amazon AWS, or Google Cloud. That being said, you may be interested in reading more about our experiences with Azure Synapse. Or why not get directly in touch with us to discuss how a data lakehouse could power your data platform?