Diving into the data lake
The data warehouse is an architecture of its era. When it was designed and until the early 2000s, the main source by far of its data was operational systems that managed the business processes of the enterprise. Such process mediated data (Devlin, 2013) was (and continues to be) defined, structured, and managed within the enterprise. As a result, it is generally well-governed and limited in scope and size. The data warehouse architecture is optimized for data with these characteristics.
Dr. Barry Devlin, Founder and Principal, 9sight Consulting
Barry will be presenting the course ‘Design and Build a Data Driven Digital Business—From BI to AI and Beyond‘ via live streaming 1-3 June 2020.
He will also be presenting the course ‘Essentials of Data Warehouses, Lakes and BI in Digital Business‘ face-to-face and via live streaming 16-17 November 2020, London
View all entries in this blog series: Part 1, Part 2, Part 3
The Internet changed the playing field, possibly forever. By the early 2000s, new types of data were blossoming in ever-increasing volumes on the Internet and at its interface to the enterprise. Business saw opportunities bloom and threats multiply. Collecting and using this data became an obsession, harvesting from clickstreams, social media, and—more recently—the Internet of Things (IoT). Relational databases could not handle data at such size or speed. Up-front modeling must be replaced by schema-on-read. The data warehouse was obsolete. Enter the data lake.
In a 2010 blog, James Dixon, then CTO of Pentaho, declared (Dixon, 2010): “If you think of a data mart as a store of bottled water—cleansed and packaged and structured for easy consumption—the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
Since then, data lakes have garnered widespread mindshare. Analysts, consultants, and vendors alike promote the concept. Surveys reveal that enterprises in every industry are implementing them, often declaring them to be a replacement for their existing data warehouses.
But, what is a data lake?
Given the watery metaphor, it may be unsurprising the definition of a data lake has remained fluid in the eight years since its inception. Gartner’s definition (Gartner, 2018) is a case in point: “A data lake is a collection of storage instances of various data assets additional to the originating data sources. These assets are stored in a near-exact, or even exact, copy of the source format. The purpose of a data lake is to present an unrefined view of data to only the most highly skilled analysts, to help them explore their data refinement and analysis techniques independent of any of the system-of-record compromises that may exist in a traditional analytic data store (such as a data mart or data warehouse).”
This definition of a data lake—and many more like it—offers little of substance on which to base a solid reference architecture describing mandatory functions, components, interactions, and so on. Architectures thus range from the all-inclusive to the poetic.
At the comprehensive end of the spectrum, IBM defines (Chessell et al, 2015) an architecture for a data reservoir—a less popular name for a data lake that suggests more engineering—consisting of six major subsystems: data reservoir repositories, enterprise IT interaction, information integration & governance, raw data interaction, catalog interfaces, and view-based interaction. Greater than 30 components are documented within these subsystems, as shown in Figure 3. The result is a system of such broad scope that it even includes IBM’s information warehouse.
In a less daunting and more illustrative take on the data lake, Bill Inmon offers: “The data lake needs to be divided into several sections, called data ponds. There is the: raw data pond, analog data pond, application data pond, textual data pond, and archival data pond… [all of which] require conditioning in order to make the data accessible and useful.” (Inmon, 2016)
Defining the shoreline of a data lake
The origins of the data lake concept can be traced back to the data flowing from Internet-related sources—at first a stream that grew rapidly to a torrent—into enterprises with the advent of the web. From the earliest days, it was evident that a place was needed to store this raw data and analyze it at detail and summary levels in support of business needs. It was also clear that the characteristics—usually described via the 3 Vs of volume, velocity, and variety—of such data made it incompatible with existing data architectures and the common storage and analytic technologies of the time.
Open source technology, such as Hadoop and associated systems, that was emerging in the mid-2000s displayed several characteristics that made it an ideal candidate to meet these storage and analytic needs. Horizontal scaling on commodity hardware offered voluminous storage and scalable processing at low cost. Because the systems were largely file-based, rather than databases, the data could be stored and processed in its raw form without the need for upfront modeling and design work. The programmatic and procedural approach to data processing (as opposed to the declarative approach at the heart of relational database technology and long favored by data management professionals) was attractive to the early adopters of the technology, because of their strong engineering backgrounds.
Taken together, these requirements and technology characteristics clearly indicated the need for a new system in the IT environment. That system, in 2010, came to be included within the concept of James Dixon’s data lake. However, as defined, the data lake was not limited to this new Internet-related “big data,” rather it was applied to all data of interest to analytic users, including data traditionally processed and made available through data marts and, by implication, the data warehouse.
At first sight, this may appear reasonable. The analytics needed for Internet-related data is similar to that required for traditional internally sourced data. Many applications demand that both types of data be linked together for analysis. This focus on user needs unfortunately, in my opinion, misses key differences between these data sources. Internally sourced process mediated data is central to business operations and is therefore relatively well governed, modeled, and intricately interlinked. The data warehouse/marts system through which it was made available to businesspeople already existed and was optimized for such data characteristics. Internet-related data is messier, poorly described, often ancillary to core business processes, and comes from diverse and unrelated sources. Furthermore, although many proponents want to store it indefinitely—just in case it might be of use some time—it is generally most useful in the short term.
In architectural terms, these differences in usage and characteristics strongly suggest that these two classes of data should be stored and processed separately according to their differing needs before being made available, either separately or jointly, to businesspeople for analysis and decision making. The system for internally sourced data has existed for thirty years; the data warehouse and data marts have proven successful simply through their longevity. In my opinion, the data lake should have been reserved exclusively for Internet-related data. For clarity, I will use the term data lough (the Irish for lake, pronounced, as in Scottish, loch) to refer to a data lake used exclusively for Internet-related data in the remainder of this series.
The key drivers of such a data lough are:
- To provide cost-effective storage for raw Internet-sourced data in large volumes and at speed
- To enable high-speed processing and ad hoc analysis of such data
- To support appropriate management/governance of this data commensurate with its value
- To offer a facility for refining, modeling, and summarizing this data and the ability to link it with process mediated data in the operational and data warehouse environments
Although the prime purpose is to support Internet-related data, it is also recognized that these attributes may also benefit traditional data and systems. For example, offloading seldom used, cold data from a data warehouse to a data lake can provide improved return on investment on both systems. Similarly, running data preparation tasks in the data lake may be beneficial.
In Part 3, we’ll look to the future.
Dr. Barry Devlin is among the foremost authorities on business insight and one of the founders of data warehousing, having published the first architectural paper in 1988. With over 30 years of IT experience, including 20 years with IBM as a Distinguished Engineer, he is a widely respected analyst, consultant, lecturer and author of the seminal book, “Data Warehouse—from Architecture to Implementation” and numerous White Papers. His 2013 book, “Business unIntelligence—Insight and Innovation beyond Analytics and Big Data” is available in both hardcopy and e-book formats. As founder and principal of 9sight Consulting (www.9sight.com), Barry provides strategic consulting and thought-leadership to buyers and vendors of BI solutions. He is continuously developing new architectural models for all aspects of decision-making and action-taking support. Now returned to Europe, Barry’s knowledge and expertise are in demand both locally and internationally.
Copyright Dr. Barry Devlin, Founder and Principal, 9sight Consulting
Chessell, M., et al, “Designing and Operating a Data Reservoir”, IBM ITSO, 2015, https://www.redbooks.ibm.com/abstracts/sg248274.html
Devlin, B., “Business unIntelligence”, Technics Publications, Basking Ridge, NJ, 2013, http://bit.ly/BunI-TP2
Dixon, J., “Pentaho, Hadoop, and Data Lakes”, Blog, 2010, http://bit.ly/2BHwplU
Gartner IT Glossary, retrieved 2 February 2018, https://www.gartner.com/it-glossary/data-lake/
Inmon, B., “Data Lake Architecture: Designing the Data Lake and Avoiding the Garbage Dump”, Technics Publications, Basking Ridge, NJ, 2016