Data warehouse? Data lake? How do you choose? In this episode Ori Soen and Dadi Atar discuss the differences between data warehouse and data lake: [1] Defining what are data warehouse and data lake [2] What are the tradeoffs organizations need to consider when choosing one or the other and how to choose between them [3] 2025 trends in the warehouse/data lake space Any further questions? ask in the comments and the team will respond!
Transcript
Welcome to Dayton Scripted, where we breakdown the complex world of data engineering and analytics. Ori, and with me as always, Daddy. Hey Daddy. Hey, what's up? It's all good. Today we're diving into a topic that's causing quite a stir in the data world. Data warehouses versus lake houses. Daddy, why don't we start with the basics? What is the fundamental difference between data warehouses and lake houses? Alright, so I'm going to explain it with a library analogy. OK, so data warehouse is like a highly organized library where every book has its proper place and it's stored in a very specific format, and it's very optimized for structured data and fast querying, traditionally using specialized storage formats and processing engines. Some of its features include what's called atomic transactions or ACID, which is basically the ability to recover from a. Sequence of events and making sure that the data is always consistent. Schema enforcement. So for example I want to prevent inserting a number to a string column and so on and so forth. OK, and what about a lake house? OK, So moving along with the library example, Lake House is like a modern library that can handle not only books but also videos, audio recordings and digital art, right? So a lot of different files and formats while maintaining the same level of organization. So potentially it combines the best features of data warehouse, but it can store any type of data and have warehouse like management. And performance features. Got it. Can you give us some concrete examples of what it means in practice? So let's return to our imaginary ecommerce business. In a traditional data warehouse, you would normally store your sales transactions, your customer information, inventory data, all very structured data that fits nicely into tables and columns and schema and whatnot. But what about product images or verbatim, right, customer reviews or chat bot logs? This is where you traditionally need a separate data links to store those various types of formats. OK, and? I assume At Lake House handles both of these in a good way. Yeah, exactly. So in the lake house, you can store and analyze all of that data in one place. Plus you get warehouse features like atomic transaction, schema enforcement, optimized performance. But with the flexibility to also run machine learning workloads or process streaming data and various different jobs. OK, that sounds great in theory, but if I've learned something in my lifetime is that's there's always a trade off somewhere. Yeah, well, it really comes down to organizations wanting the best of both worlds. Traditional data warehouses give you solid, rock solid performance for SQL analytics and beyond, right. But there weren't great with unstructured data or machine learning workloads. And data lights solved that problem, but brought their own challenges. With our data quality and governance and basically like, how do you structure all that noise or unstructured data? Yeah, I remember the early days of data lakes. They often turned into what's now is called data swamps pretty quickly. Yeah, and that was a real problem. Like the idea was to dump in all data in various formats. And have that ready for querying or doing other stuff on top of it. But that's not really the case. Without proper metadata management and governance, it becomes impossible to find anything useful from that data. It's just like noise. Plus, you had no guarantee of data quality or data consistency, right? That's where lake houses came in. Exactly. Lake houses try to bridge this gap. They're built on open data formats like Apache Iceberg or Delta Lake, which adds structure and warehouse like features to the data link. So ACID transactions like we talked earlier, scheme enforcement, data versioning and it's directly on top of cloud storage. You know, it's pretty clever, actually. OK, let's dig into some of those open formats a little bit. What makes them so special? OK, so take Data lake for example, and introduces the concept of transaction logs. So every action is being logged and that tracks all the changes in your data file. So this means you can do time traveling, look at your data as it existed at any point in the past. So it's super powerful when you want to debug issues or do anything that has to do with compliance. Another example is Apache Iceberg. That's skiing in that that is gaining a lot of traction recently. You get this amazing schema evolution capability. You can add or modify columns without having to rewrite all your data. Wow. I've also noticed that even traditional warehouse vendors are adapting to these trends. Yeah, because they had to, right. So take Snowflake for example. They now support unstructured data and machine learning learning workloads. They have, they've added features like Snow Park for Python developer and direct support for running ML models. And meanwhile Databricks, which are originally coming from the data link world, had really improved their SQL performance and their warehouse like functionality. They call their SQL engine Futon. So they're all trying to converge into a single. Offering that offers the best of both worlds yeah, it's like an arms race from two different directions converging in one. That's exactly what it is. And you know cloud providers are also in this game too, so look at AWS. So they've got Redshift for traditional warehousing, but they've also launched lake formation for lake house architecture and the recent announcement called S3 Tables that basically offers a warehouse like interface on top of S3 and Apache Iceberg. Google's got Big Query and they now support unstructured data and had the they had this really cool feature called Bigquery AML for in database machine learning and the list goes on and on. But the major trend here is converging and offering the best of both worlds, unstructured data support from a data like perspective and governance and schema enforcement from our warehouse perspective. Got it, got it. So let's let's move on and talk about what's new in 2025, What other trends are we seeing? OK, so there are three big ones that stand out in my opinion. First, there's this massive push towards real time analytics. So we're gonna stations don't want to wait for a nightly best for those nightly batch jobs anymore. They want insights as fast as the data comes in. Yeah, I'm seeing that too, especially with streaming data. Exactly. And it's not just about ingesting streaming data anymore. Organizations want to process and analyze that data in real time, often combining it with historical data. So we're saying platforms, they're adding features like materialized views that automatically update as new data arrives, streaming SQL capabilities, real time and mail modeling and so on and so forth. But the key point here is real time or near real time as possible. OK. Well, when you talk about real time, you have to also assume it puts a lot of pressure on infrastructure because you gotta have a lot of compute to do that. It does. It does. So that's why we're seeing these platform get smarter about resource management so that developing sophisticated algorithms to predict the expected workload patterns that automatically scale up and down these resources according to the requirements. So some can even optimize the query plans based on these historical patterns. But the key theme here is optimization and workload adjustment according to the job to be done. Right. OK. So we talked about a number of trends. What are the other trends we're seeing? The big one is the democratization of data. We're seeing more serverless options, pay as you go pricing and simplify management interfaces where the goal is to make this powerful platforms accessible to smaller teeth than home app dedicated data engineers. So we have a lot of data consumers that have various skill sets that do not have these engineering resources and want to enable them to work on these platforms as well, right these trends. Obviously mean that. These powerful types of data warehouses or capabilities are going to be available to all sorts of organizations, not just the big enterprises. Exactly. Yeah. So we're seeing platforms that can automatically handle a lot of the complex optimization and maintenance tasks they used to require specialized expertise that mostly enterprises have. So this means smaller teams can now leverage these enterprise grade data infrastructure without needing a large engineering team. Well, that's all fine and dandy, but with all of these choices and options, how does an organization go about making a selection? Yeah, that's a great question. I always tell teams to consider three things. First, what's your data variety like? If you're mostly dealing with structured data in SQL analytics, a traditional warehouse is probably your best bet. You don't need to overshoot here. If you need to support diverse workloads, you have machine learning or data scientists, you have streaming data. SQL analytics, unstructured data and lake house architecture might make more sense in that context. Yeah, and cost has to be a factor as well. Yeah. So it is, it's, it's the second consideration. So you need to look at your query patterns, the volume of your data and how many concurrent users you need to support. So the pricing models can be quite different. Some platforms are charging by storage, others by compute, others by query volume. You need to understand the analytical work that gets done and adjust and choose the right platform for that, right. And I assume as you do more and more work on these platforms. Cost optimization is going to become a growing concern, something very important. How do you deal with that? Yeah, yeah. So you know what's interesting? Many organizations don't even have visibility into their cloud data cost until the bill arrives. And it gets very costly. Teams are running queries without knowing the cost of what they're doing. OK, I'm, I'm sure we're going to have a lot of innovation in that area because the cost of actually analyzing data is skyrocketing and companies do more and more of that and they'll have to figure it out. But let's move on to the third factor we were talking about. Yeah. So the third factor is skill set and ecosystem. So what tools does your team know? What other systems do you need to integrate with? There's no point choosing a platform. It could be the best of breed point solution that your team. Will struggle to maintain or doesn't play nicely with your existing step. That's very important and something that a lot of people are struggling with. OK. So maybe finally any prediction where all of this is going? Yeah, I, I, I think the general thing is convergence, the distinction between lake house and warehouses will continue to blur, will probably see more specialization in terms of industry specific solutions like IoT or healthcare. But in broad strokes, convergence both from a warehouse, lakehouse perspective and the peripheral tooling is going to converge into one offering. OK, Well the $1,000,000 question here is where does AI? Factor in. Yeah. And of course, yeah, it's going to be even more central to how we manage and analyze data. We're already seeing the emergence of what some call AI first data platforms. These systems use AI not just for analysis, but for the entire data life cycle, for ingestion, to query optimization to governance, and the list goes on and on. OK. Maybe another angle that we have not talked about yet is open source. Is open source going to continue to drive this industry, continue to be important? Yeah, I, I can't stress that enough. It's gonna be crucial. The innovation we're seeing in projects like Iceberg, Delta Lake and Arrow, just so just a handful of examples, it's driving a lot of the convergence we talked about. Plus, organizations want to avoid vendor lock in and proprietary tech. They want the flexibility to run their data workloads wherever makes the most sense without these cumbersome migrations in case you want to move on to another. Offering that's always true and great insights as always, Daddy, before you wrap up, this was kind of a little bit complex. Any final advice for teams evaluating these platforms? Yeah, yeah, I'm just gonna wrap up with just start with the use cases and not with the technology, especially in a platform shift and all those AI buzzwords. Just try solving the problems that you're currently facing. Understand the problem you're trying to solve, what your future needs to be. And don't forget about the human aspect. Tools are only as good as the teams using them. As the old saying goes, smart advice, Daddy. Thanks so much to our listeners. Thanks for tuning in to another episode of Data on Scripted. Remember to subscribe, leave us a review and until next time, keep your data flowing and your pipelines running. Bye bye bye, talk to you later.To view or add a comment, sign in
Founder at @Wizerdui | UK-Based Web Design, Development & Product Design Agency | Specializing in Webflow Design, Product Development & Web Development to Create Results-Driven Websites that Accelerate Business Success
3wGreat discussion! Montara. The 2025 trends predictions are especially interesting.