Big Data, MapReduce, Hadoop, NoSQL: The Relational Technology Behind the Curtain

by Wayne Kernochan
Big Data, MapReduce, Hadoop, NoSQL: The Relational Technology Behind the Curtain

Hadoop has its limits for Big Data analytics – and relational databases may have more of a future than you think.

One of the more interesting features of vendors' recent marketing push to sell business intelligence and analytics is the emphasis on the notion of Big Data, often associated with NoSQL, Google MapReduce, and Apache Hadoop, but without a clear explanation of what these are, and where they are useful. It is as if we were back in the days of "checklist marketing," where the aim of a vendor like IBM or Oracle was to convince you that if competitors' products didn't support a long list of features, those competitors would not provide you with the cradle-to-grave support you needed to survive computing's fast-moving technology.

As it turned out, many of those features were unnecessary in the short run, and a waste of money in the long run; remember rules-based AI? Or so-called standard Unix? The technology in those features was later to be used quite effectively in other, more valuable pieces of software, but the value-add of the actual feature itself turned out to be illusory.

As it turns out, we are not back in those days, and Big Data via Hadoop and NoSQL does indeed have a part to play in scaling Web data. However, I find that IT buyer misunderstandings of these concepts may indeed lead to much wasted money, not to mention serious downtime. These misunderstandings stem from a common source: marketing's failure to explain how Big Data relates to the relational databases that have fueled almost all data analysis and data-management scaling for the last 25 years.

It resembles the scene in Wizard of Oz where a small man, trying to sell himself as a powerful wizard by manipulating stage machines from behind a curtain, becomes so wrapped up in the production that when someone notes "There's a man behind the curtain" the man shouts "Pay no attention to the man behind the curtain!" In this case, marketers are so busy shouting about the virtues of Big Data related to new data management tools and "NoSQL" that they fail to note the extent to which relational technology is complementary to, necessary to, or simply the basis of, the new features.

So here is my understanding of the present state of the art in Big Data, and the ways in which IT buyers should and should not seek to use it as an extension of their present (relational) business intelligence and information management capabilities. As it turns out, when we understand both the relational technology behind the curtain and the ways it has been extended, we can do a much better job of applying Big Data to long-term IT tasks.


The best way to understand the place of Hadoop in the computing universe is to view the history of data processing as a constant battle between parallelism and concurrency. Think of the database as a data store plus a protective layer of software that is constantly being bombarded by transactions – and often, another transaction on a piece of data arrives before the first is finished. To handle all the transactions, databases have two choices at each stage in computation: parallelism, in which two transactions are literally being processed at the same time; and concurrency, in which a processor switches between the two rapidly in the middle of the transaction.

Pure parallelism is obviously faster, but to avoid inconsistencies in the results of the transaction, you often need coordinating software, and that coordinating software is hard to operate in parallel, because it involves frequent communication between the parallel "threads" of the two transactions. At a global level (like that of the Internet), the choice now translates into a choice between "distributed" and "scale-up" single-system processing.

As it happens, back in graduate school I did a calculation of the relative performance merits of tree networks of microcomputers versus machines with a fixed number of parallel processors, which provided some general rules that are still applicable. There are two key factors that are relevant here: "data locality" and "number of connections used." This means that you can get away with parallelism if, say, you can operate on a small chunk of the overall data stored on each node, and if you don't have to coordinate too many nodes at one time.

Enter the problems of cost and scalability. The server farms that grew like Topsy during Web 1.0 had hundreds and thousands of scale-out servers that were set up to handle transactions in parallel. This had obvious cost advantages, since PCs were far cheaper. But data locality was a serious problem in trying to scale, since even when data was partitioned correctly in the beginning between clusters of PCs, over time data copies and data links proliferated, requiring more and more coordination. Meanwhile, in the high performance computing (HPC) area, grids of PC-type small machines operating in parallel found that scaling required all sorts of caching and coordination "tricks," even when, by choosing the transaction type carefully, the user could minimize the need for coordination.

In certain instances, however, relational databases designed for "scale-up" systems and structured data did even less well. For indexing and serving massive amounts of "rich text" (text plus graphics, audio and video), for streaming media, and of course for HPC, a relational database would insist on careful consistency between data copies in a distributed configuration and so could not squeeze the last ounce of parallelism out of these transaction streams. And so, to minimize costs and to maximize the parallelism of these types of transactions, Google, the open source movement, and various others turned to MapReduce, Hadoop and various other non-relational approaches.

These efforts combined open-source software (typically related to Apache), large amounts of small or PC-type servers and a loosening of consistency constraints on the distributed transactions (an approach called eventual consistency). The basic idea was to minimize coordination by identifying types of transactions where it didn't matter if some users got "old data" rather than the latest data, or if some users got an answer while others didn't.

How well does this approach work? A recent communication from Pervasive Software (about an upcoming conference) noted a study of one implementation which found 60 instances of unexpected unavailability "interruptions" in 500 days. This is certainly not up to the standards of the typical business-critical operational database, but is also not an overriding concern to today's Hadoop users.

The eventual consistency part of this overall effort has sometimes been called NoSQL. However, Wikipedia notes that in fact it might correctly be called NoREL, meaning "for situations where relational is not appropriate." In other words, Hadoop and the like by no means exclude all relational technology, and many of them concede that relational "scale-up" databases are more appropriate in some cases even within the broad overall category of Big Data (i.e., rich-text Web data and HPC data). Indeed, some implementations provide extended-SQL or SQL-like interfaces to these non-relational databases.

Page 2: The limits of Hadoop

Wayne Kernochan of Infostructure Associates has been an IT industry analyst focused on infrastructure software for more than 20 years.

Where Are the Boundaries of Big Data?

The most popular "spearhead" of Big Data, right now, appears to be Hadoop. As noted, it provides a distributed file system "veneer" to MapReduce for data-intensive applications (including Hadoop Common that divides nodes into a master coordinator and slave task executors for file-data access, and Hadoop Distributed File System [HDFS] for clustering multiple machines), and therefore allows parallel scaling of transactions against rich-text data such as some social media data. Hadoop operates by dividing a "task" into "sub-tasks" that it hands out redundantly to back-end servers, which all operate in parallel (conceptually, at least) on a common data store.

As it turns out, there are limits even to Hadoop's eventual-consistency type of parallelism. In particular, it now appears that the metadata which supports recombination of the results of "sub-tasks" must itself be "federated" across multiple nodes for both availability and scalability purposes. In fact, Pervasive Software notes that its own investigations show that using multiple-core "scale-up" nodes for the sub-tasks improves performance compared to proliferating yet more distributed single-processor scale-out servers. In other words, the most scalable system, even in Big Data territory, is one that combines strict and eventual consistency, parallelism and concurrency, distributed and scale-up single-system architectures, and NoSQL and relational technologies.

Solutions like Hadoop are effectively out there "in the cloud" and therefore outside the usual walls of enterprise data centers. Thus, there are fixed and probably permanent physical and organizational boundaries between IT's data stores and those serviced by Hadoop. Moreover, it should be apparent from the above that existing business intelligence and analytics systems will not suddenly convert to Hadoop files and access mechanisms, nor will "mini-Hadoops" suddenly spring up inside the corporate firewall and create havoc with enterprise data governance. The use cases are simply too different.

The remaining boundaries – the ones that should matter to IT buyers – are those between existing relational business intelligence and analytics databases and data stores and Hadoop's file system and files. And here is where "eventual consistency" really matters. The enterprise cannot treat this data as just another business intelligence data source. It differs fundamentally in that the enterprise can be far less sure that the data is current – or even available at all times. So scheduled reporting or business-critical computing based on this data is much more difficult to pull off.

On the other hand, this is data that would otherwise be unavailable for BI or analytics processes – and because of the low-cost approach to building the solution, should be exceptionally low-cost to access. However, pointing the raw data at existing business intelligence tools would be like pointing a fire hose at your mouth, with similarly painful results. Instead, the savvy IT organization will have plans in place to filter the data before it begins to access it.

The Long-Run Bottom Line

The impression given by some marketers is that Hadoop and its ilk are required for Big Data, where Big Data is more broadly defined as most Web-based semi-structured and unstructured data. If that is your impression, I believe it to be untrue. Instead, handling Big Data is likely to require a careful mix of relational and non-relational, data-center and extra-enterprise business intelligence, with relational in-enterprise BI taking the lead role. And as the limits to parallel scalability of Hadoop and the like become more and more evident, the use of SQL-like interfaces and relational databases within Big Data use cases will become more frequent, not less.

Therefore, I believe that Hadoop and its brand of Big Data will always remain a useful but not business-critical adjunct to an overall business intelligence and information management strategy. Instead, users should anticipate that it will take its place alongside relational access to other types of Big Data, and that the key to IT success in Big Data BI will be in intermixing the two in the proper proportions, and with the proper security mechanisms. Hadoop, MapReduce, NoSQL, and Big Data, they're all useful – but only if you pay attention to the relational technology behind the curtain.


  This article was originally published on Thursday Oct 20th 2011
Mobile Site | Full Site