Increasing numbers of IT shops are considering, and in some cases doing, "Hadoop in the enterprise." That is, they are more or less implementing the file systems and data-access mechanisms (Hadoop, MapReduce) of a cloud provider, and using them to do similar Big Data analytics -- but with the data processing done inside of IT, often in a private cloud.
However, it is not clear from the descriptions just what advantages these users gain from moving Hadoop-related analytics inside the enterprise. It is as if the word "Hadoop" has become a meaningless buzz word, meant to indicate that one is fully current on the latest trends in Big Data analytics.
To understand just what advantages one might gain from "Hadoop in the enterprise," it is important for IT users to understand just what Hadoop and its related software is -- and is not. Here is a somewhat brief version of the story.
Hadoop's Place in Computing Universe
The best way to understand the place of Hadoop in the computing universe is to view the history of data processing as a constant battle between parallelism and concurrency. Think of the database as a data store plus a protective layer of software that is constantly being bombarded by transactions -- and often, another transaction on a piece of data arrives before the first is finished.
To handle all the transactions, databases have two choices at each stage in computation: parallelism, in which two transactions are literally being processed at the same time, and concurrency, in which a processor switches between the two rapidly in the middle of the transaction. Pure parallelism is obviously faster. To avoid inconsistencies in the results of the transaction, though, you often need coordinating software, and that coordinating software is hard to operate in parallel because it involves frequent communication between parallel "threads" of the two transactions.
At a global level (like that of the Internet), the choice now translates into a choice between "distributed" and "scale-up" single-system processing. There are two key factors that are relevant here: data locality and number of connections used -- which means that you can get away with parallelism if, say, you can operate on a small chunk of the overall data store on each node, and if you don't have to coordinate too many nodes at one time.
Enter the problems of cost and scalability. The server farms that grew like Topsy during Web 1.0 had hundreds and thousands of PC-like servers that were set up to handle transactions in parallel. This had obvious cost advantages, since PCs were far cheaper; but data locality was a problem in trying to scale, since even when data was partitioned correctly in the beginning between clusters of PCs, over time data copies and data links proliferated, requiring more and more coordination.
Meanwhile, in the high performance computing (HPC) area, grids of PC-type small machines operating in parallel found that scaling required all sorts of caching and coordination "tricks," even when, by choosing the transaction type carefully, the user could minimize the need for coordination.
For certain problems, however, relational databases designed for "scale-up" systems and structured data did even less well. For indexing and serving massive amounts of rich-text (text plus graphics, audio and video) data like Facebook pages, for streaming media, and of course for HPC, a relational database would insist on careful consistency between data copies in a distributed configuration, and so could not squeeze the last ounce of parallelism out of these transaction streams.
So to squeeze costs to a minimum, and to maximize the parallelism of these types of transactions, Google, the open source movement and various others turned to Hadoop (for file-system access) on top of MapReduce (for more basic file-system access), and various other non-relational approaches.
These efforts combined open source software, typically related to Apache, large amounts of small or PC-type servers and a loosening of consistency constraints on the distributed transactions -- an approach called eventual consistency. The basic idea was to minimize coordination by identifying types of transactions where it didn't matter if some users got "old" rather than the latest data, or it didn't matter if some users got an answer but others didn't. The result continues to be much more frequent unexpected unavailability “interruptions."
The most popular spearhead of Big Data appears to be Hadoop. As noted, it provides a distributed file system "veneer" to MapReduce for data-intensive applications (including Hadoop Common that divides nodes into a master coordinator and slave task executors for file-data access, and Hadoop Distributed File System [HDFS] for clustering multiple machines), and therefore allows parallel scaling of transactions against rich-text data such as some social media data. It operates by dividing a task into sub-tasks that it hands out redundantly to back-end servers, which all operate in parallel (conceptually, at least) on a common data store.
As it turns out, there are limits even on Hadoop's eventual-consistency type of parallelism.
In particular, it now appears that the metadata that supports recombination of the results of sub-tasks must itself be federated across multiple nodes, for both availability and scalability purposes. And using multiple-core "scale-up" nodes for the sub-tasks improves performance compared to proliferating yet more distributed single-processor PC servers.
In other words, the most scalable system, even in Big Data territory, is one that combines strict and eventual consistency, parallelism and concurrency, distributed and scale-up single-system architectures, and NoSQL and relational technology.
What Do We Gain with Hadoop? What Do We Risk?
Fundamentally, then, Hadoop in the enterprise, as in the cloud, allows scaling to analytics on yet more amounts of (usually unstructured) data. That's it; that's what we gain. And note that compared to the multiple sources of Big Data out there now (multiple cloud providers), Hadoop in the enterprise offers only the ability to combine the data from these sources.
However, Hadoop in the enterprise comes with no automatic way to sync with existing analytics. It has no automatic way to buffer the enterprise from unexpected outages. It provides no way to distinguish between the data that should go into a Hadoop file system and the data that should go into a data warehouse, because cloud providers assume that their data should be in a Hadoop file system.
Some of that is provided by vendor solutions, but most of it will need to be done by IT itself.And that's a massive job, properly done. If not properly done, it exposes the organization to risks of unexpected and badly handled crashes, corruption of the data warehouse, misanalysis based on flawed "raw" data, and poorly coordinated parallel data-warehouse and Hadoop analytics. Implementer beware.
I conclude that you should not over-estimate, nor over-promise, what "Hadoop in the enterprise" can do. It may give you valuable training in analyzing cloud data, and implementing apps based on Hadoop-type file systems. It may allow the organization to run a "second scan" of cloud Big Data before it moves into the data warehouse, to allow some weeding out of less valid analyses. And, as noted, it may allow pre-combination of data from multiple cloud providers -- although data virtualization solutions give you that and more besides.
These are typically not major advantages, IMHO.
Unless you are very confident in your ability to handle the risks I have listed above, I would not over-promise what "Hadoop in the enterprise" can do. Let's face it: In many cases, it really doesn't give you much more than an effectively-designed use of Hadoop in the cloud.
Wayne Kernochan is the president of Infostructure Associates, an affiliate of Valley View Ventures that aims to identify ways for businesses to leverage information for innovation and competitive advantage. Wayne has been an IT industry analyst for 22 years. During that time, he has focused on analytics, databases, development tools and middleware, and ways to measure their effectiveness, such as TCO, ROI, and agility measures. He has worked for respected firms such as Yankee Group, Aberdeen Group and Illuminata, and has helped craft marketing strategies based on competitive intelligence for vendors ranging from Progress Software to IBM.