Hadoop is an open source software framework that enables organizations to process huge amounts of data, huge as in petabytes. R is an open source software programming language popular with statisticians who have long used it for data mining and creating predictive models.
Revolution Analytics, a company that provided the first commercial distribution of R in 2007 and thus is to R what Red Hat is to Linux, thinks a combination of Hadoop and R has the potential to be a match made in Big Data heaven. To help illustrate this, Revolution sponsored an “Applications of R in Business Competition.”
Moving from processing and storing huge amounts of data to analyzing it marks the next step of an evolution in how enterprises will utilize their data, said David Smith, Revolution’s vice president of marketing and community, noting that Revolution published an integration between Hadoop and R in 2011. “Enterprises have spent a lot of money storing all this data. Now they want to analyze it,” he said.
R has been especially popular among data statisticians and academicians for predictive analysis and forecasting, an area that is of increasing interest to enterprises. “It’s no longer just about extrapolating trends but actually trying to forecast them, using not just your own databases but external data sources as well,” Smith said.
Smith anticipates a growing number of “intersections between R and Hadoop” and other parallel processing environments as enterprises up their data analysis game. In an interesting real-world example of the possibilities, three researchers used a combination of R and IBM’s Netezza platform to test the effectiveness of proposed stock market controls designed to prevent huge and sudden swings in trading. Working for Barron’s, they analyzed 24 billion-plus transactions of U.S. stock that occurred between 2008 and 2010, a task that required 8,035 hours of computer processing across 60 processors in parallel.
The winners of Revolution’s contest, Nationwide Insurance’s Shannon Terry and Ben Ogorek, created a browser-based In-flight Forecasting System, designed to forecast the total incremental benefit of a marketing tactic when only a fraction of customer responses have been observed. The goal is to identify both successes and failures early on in campaigns to maximize the value of marketing budgets. Using the software, marketers can test alternatives, evaluate their relative impact while campaigns are “in-flight” and adjust marketing spend when needed.
A key part of their entry involved what they referred to as an “obscure technique known as isotonic regression.” Because this functionality is built into R, it required relatively little coding on their part.
They also used R to create a browser-based version of their code, migrating the R code running on their laptops to a Linux-based server, then using CGI processing to create style sheets resembling a presentation template popular with Nationwide’s internal business users, along with supporting HTML functions.
The user-friendly interface “allows our business partners to access the power of R via their favorite web browser without having to install any software or submitting any R code directly,” they wrote in their entry. “In fact, the only way that our business partners would even know R is running is to recognize when the URL ends with ‘.R’ Since R is a business-owned and managed tool at our organization, this general approach has allowed us to rapidly deploy many analytic applets and widgets that serve as prototypes. The best of these mini applications move out of our lab and are rebuilt with the assistance of our IT partners in a more scalable R environment.”
“An opportunity to demonstrate the business relevance of R was too important to pass up,” said Terry, Nationwide’s director of Information Science. “With these contest submissions, we are excited to share our favorite applications of R from 2011. The open-source R community has genuinely inspired Nationwide, and thanks to this incredible community we are generating new ideas and enhancing our analytic capabilities."
Like some other open source projects, R has an active support community. Many of the Revolution contest entries were based on “packages” created by members of the community, which Smith likened to “building blocks for applications.” There are more than 4,000 packages today and they are “increasing exponentially,” Smith said.
Contest runner-up Jeffrey Breen from Atmosphere Research Group used several packages to create his application for Mining Twitter for Airline Consumer Sentiment, in which he downloaded a tweet stream, searched for relevant hash tags such as “Delta” and “Southwest” and analyzed whether consumers had positive or negative reactions to the airlines over time. Breen had to wrote just 40 or so lines of code, as he used an R package that allows developers to bring tweet streams into R environments and another for performing sentiment analysis, along with some of R’s graphical capabilities to visualize the results. Breen received a $5,000 cash prize.
Terry and Ogorek received a $10,000 cash prize for their winning efforts, plus Revolution R Enterprise 5.0, Revolution Analytics’ production-grade R software, which extends open source R’s capabilities with commercial enhancements for higher performance, greater scalability and stronger reliability. Enterprises typically work with data sets far larger than those used in the contest, Smith said. While R works in-memory and thus is limited by the amount of available RAM, Revolution addresses these capacity and scale issues by providing a parallel data processing framework for R that “lets you run lots of data in parallel at once, on one machine or on a cluster of machines as in Hadoop,” Smith explained.
The business people that get these kinds of analyses likely don’t realize what kind of work goes into the final product and “the competition was a way for us to highlight that,” Smith said.
The contest also demonstrated the potential usefulness of predictive analytics throughout an enterprise, Smith said. “Each of the applications selected demonstrates how predictive analytics with R goes well beyond traditional business intelligence to empower business decision-makers to evaluate critical success factors within important business processes early and often. We see from the winning entries that a wide variety of industries and processes – from marketing, heavy manufacturing, clinical trial design, and IT project management all share a need for ongoing predictive analysis.”
One of the judges, Edd Dumbill, chairman of O’Reilly Media’s Strata Conference, said, “R attracts some incredibly innovative minds, and it has been exciting to see a wide set of examples of R being put to use in business. It is imperative that businesses get agile and powerful tools to work with to analyze data."
Honorable mentions – with each receiving a $1,000 cash prize -- went to:
- Steve Simon for “Predicting the Duration of Your Clinical Trial,” a prediction model for patient accrual in clinical trials, which is a critical success factor in completing trials on time and on budget.
- Bengt Maas and Hakan Koç for “Towards an Ideal Steel Plant,” an online temperature prediction model for steel that calculates optimal default melting, processing, and casting temperatures so that processing conditions remain stable and higher material quality is ensured.
- Shannon Terry and Ben Ogorek for “Quantifying Uncertainty in IT Estimates,” in which they rendered accurate estimates of IT work effort, critical for deciding where in technology a business should invest.
- Yihui Xie for “knitr: The Elegant, Flexible, and Fast Dynamic Report Generation with R,” an application that saves time and human effort when analyzing R projects by eliminating the need to re-enter figures.
- Imre Kocsis for “Time Series Analysis and Order Prediction with R,” a predictive model that supports visual data exploration and can be used to predict to arbitrarily long horizons in the future.