It has been a decade-long dilemma. Do you replace your data warehouse with a data lake? Do you build a lake as well as keeping your warehouse? Could your lake actually be a warehouse? Or vice versa? It’s all just too confusing. Never fear, help is at hand!
Dr. Barry Devlin, Founder and Principal, 9sight Consulting
Barry will be presenting a 3-day course on ‘Design and Build a Data Driven Digital Business—From BI to AI and Beyond‘ 1-3 June 2020, London
Watch Barry’s webinar on the subject, ‘Forget Warehouse vs. Lake – You Need a Modern Data Architecture‘
Last year was the thirtieth birthday of the data warehouse. By contrast, the data lake is still only nine years old. The image of the old prize-fighter and the upstart contender slugging it out is unavoidable. In the first bout, the data warehouse triumphed: the data lake was too poorly defined, the tools too immature. By the second, mid-decade contest, the data lake was a clear winner: improved technology, combined with burgeoning data volumes and varieties, left the warehouse sprawled on the canvas for many firms. Now, the third fight is at hand with warehouse technology making a comeback and the lake looking more like a swamp in many cases. Who will prevail?
A better question is, who cares? We’re missing the point. Framed in either/or terms, we imply that one construct can displace another. Such thinking is flawed, driven by outdated marketing messaging from the early part of this century. Our prize-fighters must call off the fight—now.
Data warehouses and lakes are complementary concepts that emerge from different business needs and technological possibilities. Three conclusions arise. First, we can—and should—have both. Second, function can be distributed and redistributed between the two environments based on best fit. Third, the warehouse, lake, and other data management constructs can—and should—be fully integrated into a modern data architecture to deliver better, faster, more agile and cost-effective information preparation, processing and delivery to meet rapidly growing and morphing business needs in the digital transformation era.
Warehouse and Lake—What’s the Difference?
Strangely, many people—especially software vendors—answer the question in technology terms. They say a warehouse is built on a traditional relational database system (RDBMS) and thus heavily structured and predefined. A lake, they contend, is built on (or maybe excavated from?) the extended Hadoop ecosystem (EHE)[i] and is thus highly flexible, elastic and cheaper.
I prefer to start from the business needs. A data warehouse responds to the business need to deliver consistent, reconciled answers to key business questions from diverse, inconsistent, and often internal data sources. A data lake supports the business need to access all available data from all conceivable sources as quickly as possible, in order to respond to or anticipate every relevant change in the external world.
A warehouse favours consistency and focus over timeliness and breadth. A lake flows in the opposite direction. Ask the business which is preferred, and no easy answer is forthcoming. The business clearly needs all four aspects at different times and in different circumstances. The conclusion, quite simply, is that you need both a warehouse and a lake.
The answer is a little more nuanced,
however. The consistency goal of a warehouse cannot be completely ignored in
the data that fills a lake. The timeliness goal of a lake seeps into the
warehouse as even C-level decisions are accelerated by digital transformation.
Deeper thought shows that the data sources of the warehouse and lake cannot be
so cleanly separated. You need an integrated warehouse / lake ecosystem. You
need a modern data architecture.
[i] By “extended Hadoop ecosystem”, I mean the multitude of components (such as Spark, Parquet, Kudu, Druid, etc.) that have been developed around the core Hadoop set, as well as the core components themselves.
Oh Dear, What Have We Done?
I speak to many businesses that have replaced their warehouse with a lake. Based on the above, have they made a costly mistake? Possibly. Because their IT departments were using the technology driven definitions of warehouse and lake. Moving warehouse function into the EHE may allow the old RDBMS warehouse to be phased out, but the requirements—consistency, reconciliation and focus—seep into the lake, there to be rebuilt on the burgeoning array of RDBMS components now available in the open source world. Of course, the cost of migration and reimplementation of warehouse function in the EHE may well exceed the savings of turning off that traditional RDBMS.
A Modern Data Architecture
The solution to the above conundrums is relatively straightforward, and has been outlined as far back as 2013 in my book, “Business unIntelligence.”It recognises that there exists (at least) three broad data/information usage types. Process-mediated data is the traditional transactional and reference data created by the operational and informational processes of the business. Machine-generated data relates strongly to the Internet of Things today, but also covers data from internal machines and sensors. Human-sourced information refers to social media and other textual and multimedia sources. I say at least three because the emergence of artificial intelligence in the interim or specific business directions may require the introduction of additional types. At a conceptual/logical architecture level, these data types are represented by pillars, each with its own role and value, and linked through context-setting information—a superset of traditional metadata, as shown in the accompanying figure.
In physical terms, each pillar may be implemented in the most appropriate technology, and in cloud, on-premises, or hybrid environments. A modern data architecture is thus technology agnostic and naturally virtualised. If you haven’t already moved your warehouse functionality to the EHE, this architecture doesn’t suggest that you should. In fact, in my view, process-mediated data that already exists mainly in traditional RDBMS environments is probably best left there in many circumstances. Of course, data provisioning and preparation—as well as data access and virtualisation—processes must all support multiple data storage environments.
As has often been the case in the IT industry, far too much up-front attention has been paid to technology in the decade-old data warehouse vs. data lake contest. Technology choices should be driven by business needs and (so-called) non-functional requirements such as data quality, consistency and timeliness. Both business and non-functional requirements change over time, as does the abilities of different technologies to address them.
As we approach a new decade, these requirements are most succinctly summarised in the concept of digital transformation. The scope and complexity of this revolution in business, technology, and indeed the world at large requires us to call a quick halt to the warehouse vs. lake prize fight and adopt the precepts of a modern data architecture.
I will be presenting the following sessions at the Enterprise Data and BI & Analytics Conference Europe and Data Ed Week in London, 18-22 November.
- Essentials of Data Warehouses, Lakes and BI in Digital Business
- From Analytics to AI: Transforming Decisions in Digital Business
- Data-Driven AI: Opportunities and Threats
Dr. Barry Devlin is among the foremost authorities on business insight and one of the founders of data warehousing, having published the first architectural paper in 1988. With over 30 years of IT experience, including 20 years with IBM as a Distinguished Engineer, he is a widely respected analyst, consultant, lecturer and author of the seminal book, “Data Warehouse—from Architecture to Implementation” and numerous White Papers. His 2013 book, “Business unIntelligence—Insight and Innovation beyond Analytics and Big Data” is available in both hardcopy and e-book formats. As founder and principal of 9sight Consulting (www.9sight.com), Barry provides strategic consulting and thought-leadership to buyers and vendors of BI solutions. He is continuously developing new architectural models for all aspects of decision-making and action-taking support. Now returned to Europe, Barry’s knowledge and expertise are in demand both locally and internationally.
Copyright Dr. Barry Devlin, Founder and Principal, 9sight Consulting