For several years Hadoop-based alternatives to traditional SQL databases have been the focus of user attention. After all they are cheap or open source, well versed in public cloud environments and, in many cases, scalable to larger amounts of Big Data than the old relational behemoths.
However, in achieving these feats, NoSQL data processors chose different approaches to data management than relational/SQL ones – approaches that make them significantly more limited than SQL databases in key use cases. Three such "limits":
- Frequent lack of data consistency
- Comparative lack of "deep" analytics support
- Lack of high-availability features for private and hybrid clouds
Lack of Data Consistency
Hadoop, MapReduce, MongoDB, Hive and so on were designed to scale beyond SQL databases in public cloud scenarios with rapid arrival of lots of data. To do this, they allowed "eventual consistency," in which "commit" operations that ensure data consistency can be delayed to avoid holding up transactions. During these delays, some of the data will therefore be inaccurate – but users accepted this as a small price to pay for additional insight into data.
More recently, Hadoop-based systems began to handle operational rapid-update data in clouds. Hadoop-based systems in most cases make this data available before it has been written to disk – so the data is lost in case of a system crash before the write occurs, and later analyses see a different set of data than the original "readers."
Over these last few years, Hadoop-based databases like MongoDB have improved the percentage of data that is consistent – but they are still not as good as a relational database. Likewise Hadoop-based databases are beginning to offer "high availability" operational processing that ensures writing to disk before reading, but most use cases continue to allow read-before-write.
Lack of Deep Analytics Support
In order to achieve high performance, Hadoop-based databases avoid the software overhead of complex data semantics that allow end users to carry out "what-if" analytics by analyzing multiple dimensions of data characteristics in a flexible way. Where there is a very wide range of ways to slice the data, or the data is changing in type and meaning fairly rapidly, doing such querying on the fly may be preferable --but where the data scientist wishes to drill down to understand more about what is happening, SQL and relational for the most part win hands down.
Lack of High-Availability Features
This one is more a matter of the relatively recent advent of Hadoop-based NoSQL data processing, but it is a significant limit. Where an IBM, an EMC, an Oracle or an SAP may include many failover, remote-copy and roll-back/roll-forward features, these are relatively rare in NoSQL environments. Moreover, SQL databases have had years to determine what is the best way to achieve high availability in scale-up and moderately scale-out environments, while NoSQL solutions -- born in massively-scale-out public clouds -- must now adapt to private cloud, moderate scale-out needs for high availability as well.
SQL or NoSQL or …
A speaker at the recent In-Memory Computing Summit in San Francisco, where both NoSQL and SQL were well represented, put it best: They saw a movement among their clients from SQL to NoSQL, and now they are seeing a movement back to SQL. This does not imply that NoSQL is going away; but it does suggest that NoSQL is focusing more and more on "operational" data processing, while SQL databases continue their strong showing in analytics – which is increasing in strategic importance to the user base.
And so, the limits on NoSQL suggest that such a division of labor makes a great deal of sense in many use cases, or even (as some speakers suggested during the summit) as an enterprise-wide or cloud-wide database architecture. NoSQL limits don’t mean lack of usefulness in general. They do, however, suggest that savvy users will limit their enthusiasm for Hadoop uber allies.
Wayne Kernochan is the president of Infostructure Associates, an affiliate of Valley View Ventures that aims to identify ways for businesses to leverage information for innovation and competitive advantage. An IT industry analyst for 22 years, he has focused on analytics, databases, development tools and middleware, and ways to measure their effectiveness, such as TCO, ROI and agility measures. He has worked for respected firms such as Yankee Group, Aberdeen Group and Illuminata, and has helped craft marketing strategies based on competitive intelligence for vendors ranging from Progress Software to IBM.