The open source Apache Hadoop project has become synonymous in recent years with the Big Data ecosystem. Today the biggest update since the start of the Hadoop project is becoming generally available, with the release of Hadoop 2.0
Building Hadoop 2.0 required four years of effort and involved a good deal of complexity.
"A lot of complexity arises from the fact that Hadoop is at the bottom of the stack for storage and data processing, and there is a rich ecosystem of components on top," Arun C. Murthy, release manager of Apache Hadoop 2 and founder of Hortonworks, told Enterprise Apps Today. "Hadoop is already a key component of the modern data architecture on top of which several enterprises have built their applications."
Overall expectations for Hadoop are high, Murthy said, and it takes a while to justify and deliver.
Among the key additions in Hadoop 2.0 is a technology called YARN, which Murthy explained is the operating system to run applications such as MapReduce and Storm. "YARN provides resource management in a generic manner," he said.
Murthy explained that in Hadoop 1.x MapReduce was both the system and the framework. Because it was so deeply intertwined, Hadoop could only do MapReduce there. With YARN, Hadoop evolves beyond just batch process to a model in which data can be processed in real time with Storm and interactive queries can be executed via Hive.
What's Next for Hadoop?
Now that Apache Hadoop 2.0 is generally available, there are still areas of improvement slated for future releases. Better scheduling features in YARN are among the future improvements, Murthy said, along with support for in-memory data cache for HDFS data.
Sean Michael Kerner is a senior editor at Enterprise Apps Today and InternetNews.com. Follow him on Twitter @TechJournalist