Software marketing executives spend a lot of time looking for problems their solutions can solve. With Big Data, they don't have to look very hard.
"I was at a CIO symposium, and the first question was, ‘How many people here feel as though their organization has a Big Data problem?’ and every hand in the room went up,” said Scott Schlesinger of Capgemini. “The next question was, ‘How many of you feel as though you have an understanding of what that means and how to address the Big Data problem?’ and sadly not one hand, in a group with some of the best and brightest CIOs in the country, went up.”
While some vendors present Big Data as a standalone problem with Hadoop as a standalone solution, experts say it’s a myth that Hadoop will usurp existing relational databases. Instead, most see the Hadoop stack as a tool for handling unstructured, high-volume processing. So it’s important to think about how Hadoop and other Big Data solutions will fit in with your current enterprise applications and IT systems.
With that in mind, here are six enterprise-friendly ways you can get started with Hadoop.
Use Hadoop as a staging layer for analytics
Hadoop is one of the most talked-about Big Data solutions, and with good reason: It’s relatively cheap, open source and includes not just the Hadoop Distributed File System from which it gets its name, but a stack of technology tools designed to help deal with Big Data.
Given that the Hadoop stack evolved from Google’s need for a better way to process data for searches, it’s hardly surprising that analytics remains a focal point for many Hadoop deployments. Former Forrester Research analyst and current IBM Big Data evangelist James Kobielus said Hadoop is typically deployed “as a massive data acquisition and staging layer for all this unstructured content coming from social media.”
This is where Hadoop MapReduce — which Kobielus calls the real “core” of Hadoop — comes into play. “What MapReduce represents is the industry’s first-ever vendor-agnostic framework for building a broad range of advanced analytics,” he said. “What we think of as advanced analytics is supported within this sort of abstraction framework for development called MapReduce.”
After the data is transformed, it’s fed downstream into another database, such as an Oracle Exadata data warehouse or OLAP cubes, or even to an in-memory platform such as Oracle Exalytics, he added.
Movement into and out of Hadoop is not an issue, Kobelius said, since most traditional ETL tools already offer interfaces and connectors to the Hadoop Distributed File Store (HTFS), which is the database for unstructured content in Hadoop environments.
“In fact, the data warehousing companies all support connectors into Hadoop — mostly bilateral connectors to move data into and out of Hadoop clusters and into and out of their data warehouses,” he explained. “The data movement or function or concern is well addressed by the products that are already out there.”
Supplement enterprise data warehouse platform with Hadoop
Hadoop isn’t the only way to solve Big Data problems. In fact, the vast majority of Big Data analytics can be handled with traditional enterprise data warehouse (EDW) platforms, according to Kobielus and other industry experts. Teradata, Oracle Exadata, IBM Smart Analytics systems and EMC Greenplum database all use traditional approaches to dealing with large amounts of data.
“A lot of people out there—some CIOs included — don’t have a firm grasp on what Hadoop is, how to leverage Hadoop and how Hadoop fits within their current environment,” Schlesinger said. “A lot of people have some of those existing technologies in their landscape now and they want to understand, how can they leverage those, maybe in conjunction with Hadoop, but what can they do with what they have now?”
Why even discuss Hadoop when these solutions exist? Because Hadoop is more scalable, Kobielus said. It also happens to be open source, potentially more affordable and fast.
If you’re dealing with large volumes of structured data, you might be better off with a traditional data warehousing tool. But even if you do decide to use Hadoop or some other NoSQL solution, you’ll need to determine how Big Data tools fit in with your existing enterprise data warehouse environment. Forrester notes that Hadoop clusters typically are used as a staging layer for transforming data before it’s loaded into a traditional EDW, or as a staging layer for storage behind an EDW or data mart.
You’ll seldom see a standalone Hadoop stack, said Pentaho’s founder and chief strategy officer, Richard Daley.
“What we see out in the field and customers are actually putting these things into production is a very clear hybrid environment,” Daley said. “Hadoop will coexist with NoSQLs like a Cassandra and they're also going to coexist with existing relational data technologies and even some of the high-performance analytical engines. They're there to augment those (relational) environments.”
Try a Hadoop appliance
Many companies lack the Hadoop experience to build their own system from scratch but still want the processing power of Hadoop. For these companies, a Hadoop appliance can be a good way to get started on processing Big Data without being encumbered by hardware.
EMC Greenplum HD and Oracle’s Big Data Appliance are appliances that run Apache Hadoop and MapReduce, which Kobielus calls the real “core” of Hadoop. This space is getting larger every day, thanks primarily to Cloudera, which offers its own distribution of Apache Hadoop along with enterprise service for deployments. Dell, HOP and Oracle all offer hardware appliances that run Cloudera’s distribution of Hadoop.
To give you an idea of what that would cost, Oracle’s Big Data Appliance launched in January at a price point of $450,000. Add 12 percent hardware and software maintenance to the base cost, which works out to be $54,000.
Experiment with Big Data in the cloud
Sometimes it’s better to rent, particularly if you’re still figuring out how Big Data fits in with your overall business strategy. Amazon Web Services offers a subscription-based Elastic MapReduce (EMR) service that’s won over both large and mid-sized enterprises. For very little up-front investment and zero hardware costs, you can be up and running with an enterprise-grade Hadoop platform. Another Big Data option from Amazon: DynamoDB, a NoSQL database.
The down side? The Forrester Wave: Enterprise Hadoop Solutions, Q1 2012 cautions that it only supports unidirectional integration with third-party enterprise data warehouses. Still, AWS has a long list of partners that offer support for Hadoop queries, modeling, integration and business applications.
Google Compute Engine recently announced a partnership with MapR that will make Hadoop available on that service as well. Google also offers the BigQuery analytical database. Its cloud application hosting service, AppEngine, also offers a MapReduce tool.
Try a hybrid Hadoop solution
Hadoop isn’t an all-or-nothing proposition. Hadoop is open source, which makes it easy to offer new distributions of Hadoop as well as hybrid solutions that natively incorporate aspects of Hadoop. From massively parallel data warehouse products to data integration tools, you’ll find there are plenty of ways to connect to, deploy and use Hadoop with your enterprise apps.
MarkLogic, which is best known for providing the unstructured database server for Lexis Nexis, is great example of how Hadoop can be used as a companion technology to create interesting new use cases, Kobielus said.
“Hadoop is not just running inside this thing called a Hadoop distro. The MapReduce is executing inside a traditional database or inside a data integration sort of fabric,” he said. “And, really, there’s a broad range of innovative other vendors that are sort of hybridizing Hadoop and RDBPMs and NoSQL.”
Hire Hadoop expertise
If you have a Big Data problem, but don't have the staff expertise or time to figure it all out, then it might be time to bring in a Hadoop consultant. Although this is still an emerging area, professional service firms for Hadoop do exist. Hortonworks was founded last year in a joint venture between Yahoo and Benchmark Capital to provide just this type of professional service. The company helps both vendors and end user companies.
Capgemini also recently launched a new service line for Big Data analytics projects. Of course, there’s always IBM, which offers its own Hadoop platform. Technology research firms such as Forrester and Gartner provide yet another option.
Wait for Microsoft
Let’s face it: Some IT organizations won’t make a move on anything until there’s a Windows-based solution. Microsoft recently launched a by-invitation-only developer preview for its Apache Hadoop-based services for Windows Azure, which means there will be three cloud-based options for Hadoop.
Hadoop will also be offered on Windows Server. Microsoft plans to contribute its code back to the Apache Hadoop project, which means anybody will be able to run an open source version of Hadoop on Windows.
Loraine Lawson is a freelance writer specializing in technology and business issues, including integration, healthcare IT, cloud and Big Data.