Navigating Data Management: From Warehouses to Lakes and the Delta Lake Advantage
In the ever-evolving landscape of data management, organizations are increasingly confronted with the challenge of efficiently storing, processing, and analyzing vast amounts of data. Two popular solutions have emerged to address these needs: data warehouses and data lakes.
What Data Warehouse is?
Data Warehouses came into existence in 1980s when the businesses wanted to make decisions based on all of the data available in an organization in one place, rather than looking at data in individual departments. A Data Warehouse mainly consisted of operational data available within the organization.
Some large data warehouses also gathered external data to make intelligent decisions. The data received in a data warehouse is mainly structured as SQL tables or CSV files or semi structured such as JSON or XML files. They didn’t have the ability to process unstructured data.
The data received then goes through the ETL (Extract Transform Load) process and then load it into a Data Warehouse/Data Marts. The data loaded into the data Warehouse/Data where cleaned, validated and augmented with business meaning.
Also, they are generally highly aggregated to provide meaningful business value such as Key Processing Indicators or KPIs.
This data is then consumed by analysts and the business managers via BI reports.
They were very valuable for making business decisions and most large companies had at least one Data Warehouse by the early 2000s.
Due to the popularity of the Internet, the volume of data started to increase significantly and also the variety of their data had started to change as well.
We started to see unstructured data such as videos, images, text files, etc. and there were very valuable for making decisions.
But the Data Warehouses lacked support for the unstructured data we were seeing.
In a Data Warehouse, the data was only loaded after the quality of the data has been checked and also once it has been transformed.
This meant it took longer to develop a solution to get new data into the Data Warehouse.
Data Warehouses were built on traditional relational databases, which meant they used proprietary sort of file formats and resulted in vendor lock ins.
Also, the traditional on prem data warehouses were very difficult to scale or at times impossible.
This meant large data migration projects were required in order to scale up those databases. Storage was expensive with large vendors and also it wasn’t possible to scale up storage without compute, etc..
Data Warehouses didn’t provide sufficient support for Data Science or ML and AI workloads.
How about Data Lakes?
On the other hand, Data Lakes offer a more flexible and scalable approach, accommodating both structured and unstructured data. They store raw data in its native format, enabling organizations to capture diverse data types — from text and images to logs and social media feeds. This flexibility supports advanced analytics and machine learning initiatives, allowing data scientists to explore and derive insights from a broader range of data sources.
Data Lake Architecture was aimed at solving the issues we discussed about the data warehouses. They came into existence around the year 2011.
Data Lakes can not only handle structured and semi structured data, but they can also handle unstructured data, which is roughly about 90% of the data we are currently seeing.
The data received is ingested into a Data Lake without any kind of cleansing or transformation. This resulted in quicker timescales to develop solutions as well as fast ingestion times.
HDFS as well as cloud object stores such as Amazon S3, Azure Data Lake were really cheap, so organizations could ingest all of the data without worrying too much about the cost, which is great.
The Data Lakes were built on open source file formats such as Parquet, ORC or AVRO, which meant that we could use a wide variety of software and libraries to process and analyze the data.
Data science and the Machine Learning workloads could use the raw data as well as the transformed data in the Data Lake.
But there was one major problem: Data Lakes were too slow to service interactive BI reports, and there was lack of governance for the data.
So the industry moved towards copying the subset of the data from the Data Lake to the Data Warehouses, again to support these BI reports.
In summary, Data Warehouses were very good at dealing with BI workloads, but they lacked support for streaming, Data Science and Machine Learning workloads.
Data Lakes on the other hand, were mainly focused on Data Science and Machine Learning workloads, but they fell short of satisfying the traditional BI workloads and they supported streaming workloads, but it was difficult to combine streaming and batch workloads. On top of this due to the lack of support for ACID transactions, we ended up having unreliable data swamps in our Data Lakes.
Delta Lakes: The silver bullet
Lakehouse Architecture is aimed at bringing best of both Data Warehouse as well as Data Lakes.
They have been designed to provide better BI support as well as Data Science and Machine Learning support.
Let’s have a look at how a Data Lakehouse Architecture looks.
Similar to Data Lakes, we can ingest the operational and the external data into Delta Lakes. Delta Lake is nothing but a Data Lake with ACID transaction controls. Due to the ability to have ACID transactions, we can now combine streaming and batch workloads too and eliminate the need for a Lambda architecture.
This data could then be transformed more efficiently without the need to rewrite enter partitions in cases of reprocessing data or rewriting Data Lakes, in case of processing GDPR requests.
Data from any of these set of Delta tables could then be used for Data Science and Machine Learning workloads. Delta Lake also provides connectors to BI tools such as Power BI and Tableau.
Also, we can have roles and access controls defined for data governance. This removes the need for copying the data to a Data Warehouse.
So this is basically the Lakehouse Architecture, but there are still projects running BI reports from a relational database rather than the Delta Lake due to the lack of performance.
Benefits from moving to a Lakehouse Architecture
Similar to a Data Lake, Delta Lakes handle all types of data and they still run on Cloud Object store, such as S3 and ADLs. So we have cost benefits there and they use open source file formats.
They support all types of workloads, such as BI, Data Science and Machine Learning and provide the ability to use BI tools directly on them.
Most importantly, they provide ACID support, history and versioning. This helps us stop the unreliable data swamps being created, as in the case of Data Lakes.
They provide better performance when compared to Data Lakes and provide a very simple architecture. We do not need the Lambda architecture for streaming and batch workloads, and also we could potentially remove the need to copy the data from our Data Lake to the Data Warehouse.
How Delta Lake provides the capabilities we just discussed?
Similar to a Data Lake, Delta Lake also stores the data as Parquet files, which is open source format.
Key difference here is that when storing the data in Parquet, Delta Lake also creates a Transaction Log alongside that file. This is what provides history, versioning, Acid transaction support, time travel, etc. This is a key difference between a Data Lake and a Delta Lake.
The next layer is the Delta Engine, which is a Spark compatible query engine optimized for performance on Delta Lakes. It also uses Spark for munching through the transaction logs so that it could be distributed as well.
We then have the Delta tables, which we can register with hive meta store to provide security via roles and access, governance, data integrity via constraints, as well as better optimizations.
On the Delta Lake tables, we can run Spark workloads such as any kind of SQL transformations, Machine Learning workloads and streaming workloads as we do on a Data Lake, and most importantly, BI workloads can now directly access the Delta Lake.
Conclusion
Ultimately, adopting a Lakehouse architecture like Delta Lake not only streamlines data management processes but also empowers organizations to harness their data more effectively. This positions them to make informed, data-driven decisions in an increasingly competitive environment, eliminating the pitfalls associated with traditional data management solutions.
Delta Lake has been in existence since 2018 and rapidly evolving, but the performance still can’t be compared to a Data Warehouse when it comes to reporting.
So if you’re looking for that level of performance, which you get from a Data Warehouse, you will still have to copy the data to a Data Warehouse.
I think in a few years Delta Lakes will give comparable throughput and we wouldn’t have to go to Data Warehouse. That’s just my opinion.