The open source Hadoop project is getting a real boost today from virtualization vendor VMware. Getting Hadoop up on running on infrastructure, real or virtual, and then layering in traditional application availability and controls has been no easy task.
VMware today took the wraps off Project Serengeti, an open source effort that enables Hadoop to run on virtual infrastructure. Additionally, VMware is also contributing open source bits to the upstream Apache Hadoop project, to enable Hadoop to be more responsive and efficient when running on virtualization technologies.
"Project Serengeti is about simplifying deployment of Hadoop on VMware," Fausto Ibarra, senior director of Product Management at VMware, told InternetNews.com. "With Serengeti you can have a fully functional Hadoop deployment in as little as 10 minutes."
With Serengeti, Hadoop can run on a regular VMware vSphere deployment and then take advantage of all the usual VMware management and availability tools. Hadoop itself is packaged as a standard OVF (Open Virtualization Format) machine image file, just like any other VMware image.
The way Serengeti is deployed on vSphere is that the OVF will create two virtual machines. One is the Serengeti server; the other is the master virtual machine that will be cloned to create Hadoop nodes.
"The user connects into the Serengeti Server virtual machine, and that's where they run all the commands to create a cluster," Ibarra said. "The master virtual machine is then cloned and configured, and after that you have a fully functionally Hadoop cluster."
Serengeti also provides configuration and management capabilities for Hadoop. While there are other vendors in the market, including Cloudera with its Cloudera Manager, Ibarra doesn't see Serengeti as being competition for them. He stressed that Serengeti is for virtual infrastructure, while Cloudera Manager is focused on physical.
He added that VMware is also partnering with Cloudera to help enable the CDH distribution of Hadoop to run on Serengeti. Cloudera recently updated CDH to version 4, providing new performance and scalability features.
Moving forward, Ibarra noted that the overall direction for VMware with Serengeti is to bring Hadoop closer to the cloud.