A predictive analytics initiative can be an expensive undertaking. Fortunately, there are some great open source predictive analytics tools.
Sometimes data scientists want to model Big Data from within Microsoft Excel and RStudio. One favorite open source analytics tool for this is H2O. They can connect Big Data from HDFS, S3, SQL and NoSQL data sources, and then compare the resulting predictions. Java is required to run H2O on Windows 7, OS X 10.9, Ubuntu 12.04 or RHEL/CentOS 6.
One tutorial video demonstrates installation with command lines in Ubuntu. To unzip in Windows, the downloaded H2O file can be right clicked. Executing the command line or opening the executable file in Ubuntu or Windows creates H2O Flow Web UI on a local host.
H2O Flow provides interactive help using data scientists' own or demo data (uncompressed) on importing the files, setting up parsing options, building the models and improving predictions. Programming-savvy data scientists can choose to install the open source predictive analytics tool in Python, in R and on Hadoop.
H2O's NanoFastTM Scoring Engine is used to power up business applications. Algorithms include Distributed Trees and Regression, such as Gradient Boosting Machine, Random Forest, Generalized Linear Modeling and Component Analysis.
If data scientists are modular-minded when making predictions, they should consider KNIME, the Konstanz Information Miner. This open source analytics tool integrates components for machine learning and data mining through its modular data pipelining concept. Its graphical user interface makes it easy for data scientists to assemble nodes to build predictive analytics models, analyze data and visualize the results.
KNIME Analytics Platform runs on Windows, Linux and Mac OS. It provides over 1,000 data analytic routines, either natively or through R and Weka, for areas such as univariate and multivariate statistics, data mining, Web analytics, network analysis and social media analysis.
KNIME Big Data Extension is part of the commercial KNIME.com AG offerings. This extension offers a set of nodes for accessing Hadoop/HDFS via Hive from inside KNIME.
R is a popular, flexible open source tool but some data scientists find that it is slow, does not scale well and limits data set size. Analyzing much larger data sets is possible with HP Haven Predictive Analytics. Powered by HP Vertica and Distributed R, the open source predictive analytics tool integrates with Massive Parallel Processing platform for much faster analyses in R.
To reduce execution time, Distributed R splits tasks between multiple processing nodes. Data scientists can use R console and RStudio to analyze data and build models on Red Hat/CentOS and Ubuntu platforms. They can choose to build (with developers) custom parallel algorithms or use pre-packaged parallel algorithms.
Some data scientists want to do massive predictive analytics while inside Hadoop but without coding. One possibility is Actian Vortex Express (Hadoop SQL Edition), a free graphical community version of Actian's predictive analytics platform running data up to 500 GB on Linux powered by Vortex.
This open source predictive analytics tool runs analytic workflows at least 10 times faster than MapReduce. It uses columnar analytics database to run in an unlimited number of Hadoop HDFS nodes using YARN for resource management. Columnar databases run much faster than their row-based counterparts.
Using Actian Vortex Express' graphical interface, a data scientist can choose from hundreds of analytics functions to build workflows by dragging and dropping them in proper places.
Data scientists sometimes work with software developers to create predictive analytics applications based on customers' previous behaviors. One favorite open source analytics tool for this is PredictionIO, a machine learning server that lets data scientists reuse components and build and deploy predictive analytics applications. Developers can choose to download predictive engine templates from a gallery and customize them.
The core part of PredictionIO is an engine deployment platform built on top of Apache Spark. Event Server is used as a data collection and analytics layer on top of Apache HBase. PredictionIO runs on Linux/Mac OS X. It can be installed from Source Code (the master branch) or with Docker (community contributed) or Vagrant. It also can be launched on AWS.