The real value of the notion of a data lake, so far, is to cause IT to rethink an information architecture that is often centered around a data warehouse. I assert that other major benefits of a "data lake" will take a long time to arrive, if ever. Instead, IT should consider focusing its data lake efforts on rethinking its information architecture to create what I call an "information reservoir."
There are, I believe, three basic definitions of "data lake" out there:
- A carefully limited definition by James Dixon, Pentaho's CTO, who apparently coined the data lake term, that seeks to expand analytics to data "semantics" ignored by data warehouse schemas, and sets up a single-data-source in-enterprise "sandbox" to allow experimentation with this type of analytics.
- An attempt by Hadoop vendors and in-enterprise users to find enterprise uses for Hadoop other than downloading Hadoop files from a cloud for preliminary mining before it is merged with a data warehouse.
- A vacuous marketing slogan. It's new! It's innovative! It's whatever you want it to be!
Let's consider each of these in turn.
Raw Data Analytics and Dixon's Data Lake
Here are the key characteristics of a data lake, as Dixon seems to outline them:
- The data comes from a single application or system (because most companies have only one data source that meets the other "data lake" criteria);
- The data items are usually not equivalent to a single "transaction," so analytics that treats them as such can miss a lot of the data "semantics";
- Answering as-yet-undefined questions (e.g., ad-hoc iterative analysis) is a key use case;
- The ETL steps that precede insertion in a data mart/warehouse "lose" important metadata such as the exact times when a buyer takes each step in an online buying process, so analytics via a data warehouse will preclude this type of ad-hoc analysis.
Much of the original Pentaho user feedback that led to the data lake idea came from in-enterprise Hadoop users, so it is not surprising that many supporters automatically turn to Hadoop implementations somewhat similar to initial in-enterprise Hadoop experiments. Thus, in explanations of how to implement a data lake, a pundit may talk mainly about setting up a Hadoop data-access scheme for a given data source, treating things such as data governance as afterthoughts, easily handled by open-source software.
Strictly speaking, a relational database may indeed be used for data lake analytics, but there must be no "aggregation" of relational data items that removes information from the raw data (e.g., the logs that time-stamp the steps in the online buying process).
Data Lake as Marketing Term
Then there’s the EMC product. It appears that EMC is labeling as a "data lake" a version of NAS (network-attached storage) with some support for particular approaches to running a data lake. To put it another way, if you implement only the EMC product, you have no data in the data lake and no software tool to do analytics on that data.
We have seen this type of marketing before, notably when companies of all stripes labeled as "agile" middleware that included no development tools or obvious support for an agile development process. It is important for users, in these cases, to understand that typically the product so mislabeled can indeed form a useful part of a larger solution, but it rarely plays a key role. The user must consider many other parts of the solution before the real benefits of that solution are seen.
Is Data Lake Water Drinkable?
The real problem I have with data lakes is that, according to Dixon's definition, the "raw data" is not "cleansed" -- meaning there are no checks for accuracy of data input or consistency with other data or other copies of the same data.
It may superficially seem to Hadoop implementers that the ERP data they are handling is high quality, and in some cases it may be. However, grown-like-Topsy enterprise apps often do not provide high quality data consistent with, say, buyer data entered in the sales process app or the help desk app. Before you drink from (perform analytics on) a data lake, you had better make sure that it's drinkable water (data allowing for accurate analytics). The fact that cleansing is not included in the data lake as a high priority raises serious questions about the value of analytics insights from data lakes.
Yes, Hadoop in the cloud takes the risk of allowing analytics on lower quality data -- but only because we need the answer right away and existing relational database cleansing and ACID properties don't scale enough. No such justification exists for the data lake's single-application in-enterprise analytics.
And so, using Dixon’s original metaphor, we don't need a "data lake;" we need a "data reservoir" that ensures a "drinkable" data source by enforcing accuracy and consistency.
Data Lake and Dynamic Metadata
In one thing, I find the data lake entirely admirable: It argues that data warehousing constrains our ability to do deeper analytics because it pre-defines what's important via past usage, rigid metadata and exclusion of some data as irrelevant. It then suggests that there should be a place, operating in parallel with data warehouse analytics, in which both data and metadata should be much less rigidly defined and much quicker to change.
In fact, some of my past research suggests that perhaps one-fifth of the "cost" of bad data to the enterprise is precisely this kind of delay in picking up trends in the outside world and incorporating them in existing metadata and analytics tools. Part of this can be handled by data virtualization solutions that do metadata discovery -- but this simply retrofits new trends into an existing straitjacket. Rather, we need a solution that empowers data scientists to drive the identification of valuable new data and metadata as they perform open-ended analytics that enables deeper analysis.
Indeed, Dixon's example of how data warehousing lacks in the area of sequencing the buying process is simply an example of how today's analytics relies too heavily on statistics without consideration of causality or models explaining the sequencing of the data. So sales of radios go up as sales of refrigerators do. Which causes which? The answer is neither, as they are correlated but neither causes the other.
Only analytics that approaches the data with an open mind, careful attention to time sequencing and "experimental" metadata can distinguish this type of spurious correlation from real causes driving buyer behavior.
My recommendation to users about data lakes, then, is to seek to set up a data reservoir instead. Treat the reservoir as an experimental start to a new in-parallel-with-the-data-warehouse dynamic-metadata system aimed at this kind of open-ended data research. Above all, let the data scientist be the driver of this part of the information architecture, not the Hadoop specialist.
There's a world of promise in this rethinking of the enterprise's information architecture. However, if we get stuck on the "data lake vs. no data lake" debate, we'll never get there.
Wayne Kernochan is the president of Infostructure Associates, an affiliate of Valley View Ventures that aims to identify ways for businesses to leverage information for innovation and competitive advantage. An IT industry analyst for 22 years, he has focused on analytics, databases, development tools and middleware, and ways to measure their effectiveness, such as TCO, ROI and agility measures. He has worked for respected firms such as Yankee Group, Aberdeen Group and Illuminata, and has helped craft marketing strategies based on competitive intelligence for vendors ranging from Progress Software to IBM.