Want real-time analytics? You need stream processing. These 10 open source stream processing projects can help get you started.
The Spark analytics engine is the most popular project managed by the Apache Software Foundation, given a $300 million boost from IBM. Wikibon estimates it will account for 6 percent of total Big Data spending in 2016, growing to 37 percent by 2022.
Spark originated at the University of California, Berkeley's AMPLab. Its Spark Streaming stream processing module was updated earlier this year to facilitate implementation of lambda architecture, which takes advantage of both batch and stream processing. It's still essentially a micro-batching model, however. Written in Scala, it supports only a few programming languages. Spark Streaming lets developers reuse code for batch processing, join streams against historical data or run ad-hoc queries on stream state. It can be used to build powerful interactive applications beyond traditional analytics.
The nascent Apache project Flink is creating a buzz for its ability to handle high volumes of data and produce real-time results.
While other popular stream processing technologies rely on "micro-batching" of incoming data -- fast batch operations on a small set of incoming data during a set period, which can add latency -- Flink treats data as a true stream with no beginning or end. It can perform batch processing as well.
This German technology grew out of research at Berlin's Technical University in 2009. Its DataStream API supports streams in Java and Scala, and its DataSet API for static data supports Java, Scala, and Python. It's building out the capability to handle queries before data even ends up in a database, making a database unnecessary in some cases.
Samza is built on top of Apache Kafka, a low-latency distributed messaging system. It originated as a project at LinkedIn as a lightweight framework for continuous data processing to reduce turnaround times involved in Hadoop's batch processing.
Kafka and Samza together perform similarly to HDFS and MapReduce. Just as HDFS ingests data for MapReduce jobs, Kafka supplies data for Samza, which can deliver sub-second results. It supports jobs written in Java, Scala or other languages that support JVM (Java virtual machine). Its key differentiator is its stateful stream processing capability.
It's known for simplicity; durability – it guarantees no messages will be lost; fault tolerance and scalability. LinkedIn is still working on some issues internally that also apply to other stream processing systems, such as late-arriving data.
Often referred to as the Hadoop for real-time processing, Apache Storm is a generic event-based stream processing framework.
It's designed for scalability, fault tolerance and speed, but it is considered difficult to use. It's most recent 1.0 release, however, promises a big performance boost and improved usability.
It's written predominantly in Clojure, but integrates with applications written in any programming language that can read and write to standard input and output streams. It guarantees at-least-once processing (no data loss), but some observers criticize it for producing results that must be corrected down the line with a batch processor.
Kafka Streams is a Java library for building distributed stream processing apps using Apache Kafka. Confluent, a startup that's founded by members of the original Kafka team at LinkedIn, contributed the project to the Apache Software Foundation.
Samza, Spark, Flink and others are built on top of Kafka, the low-latency distributed messaging system used to ingest data. Kafka Streams is more focused on building stand-alone applications and microservices that need embedded stream processing capabilities without the dependency on complex clusters.
Meanwhile, the related Kafka Connect framework can be used to reliably connect Kafka with external systems such as databases, key-value stores, search indexes and file systems.
One of the older Apache Software Foundation streaming projects, Flume is used to efficiently collect, aggregate and move large amounts of streaming data into a centralized data store such as Hadoop Distributed File System (HDFS), Apache Hive or HBase for single-event processing.
Flume deploys agents contained within each instance of the Java Virtual Machine (JVM) to "listen" to the types of incoming data, such as web server logs, and channel them appropriately. It's most often used for data that is stored for processing later.
Flume was the second-most popular message broker in a survey by OpsClarity, but far behind Kafka.
With roots in intelligence gathering, the U.S. National Security Agency's Niagrafiles (NiFi) software become an Apache top-level project in July 2015. It's now backed by Hortonworks, which bills it as being on the front line of Internet of Things technology. Because IoT operators might want to push data back out to sensors or have sensors communicate with each other, NiFi is bi-directional and point-to-point.
Users can use its intuitive graphical interface to design data flow, to identify their most and least useful data streams and more. Data from sources such as file systems, social media streams, Kafka, FTP and HTTP can flow to a variety of destinations for myriad purposes, and transformations can be introduced into that data flow.
NiFi dynamically generates a visual chain of custody in real time for every piece of data flowing through, and can define two separate paths for the same data sets, such as near real-time processing (hot path) and batch processing (cold path).
It's also billed as effective in securing the "jagged edge" of millions of different sensors with different protocols in different places.
Ignite is an in-memory layer built on top of a distributed in-memory computing platform designed to deliver real-time analytics, machine-to-machine communication and high-performance transactional processing. Ignite loads data into memory from the existing disk-based storage layer-- it works with relational database, NoSQL and Hadoop data stores -- to deliver huge performance gains. It can be scaled to handle petabytes of data simply by adding more nodes to the cluster.
It uses a distributed, massively parallel architecture on commodity hardware to power existing or new applications. Ignite can run on premises, on cloud platforms such as AWS and Microsoft Azure or in a hybrid environment.
The Apache Ignite unified API supports SQL, C++, .Net, Java, Scala, Groovy, PHP and Node.js.
Also a young Apache Software Foundation project, Apex grew out of DataTorrent. Conceptually similar to Flink, it promises speed, scale, high throughput, low latency and fault tolerance for the Hadoop ecosystem. Its creators, though, say it's more enterprise-tested than Flink. Positioned as an alternative to Storm and Spark for real-time stream processing, it's reported to be at least 10 times faster than Spark. And unlike Spark, which calls for strong Scala skills, Apex can be used by exiting Java developers.
This Google-contributed project is still in the Apache Software Foundation incubator. Beam attempts to unify data processing frameworks with a core API, allowing easy portability between execution engines. It defines a directed acyclic graph data processing engine that can handle unbounded streams of data where out-of-order delivery is the norm rather than the exception. Similar to Google's Dataflow managed service, it uses the Dataflow SDK (software development kit) and a set of "runners" that map the SDK primitives to a particular execution engine.
It allows programmers to build stream processing pipelines that can then run in supported back ends including Google Cloud, Flink and Spark. Other runners such as Storm and MapReduce are in the works.