Apache Spark is making headlines as potentially the next big thing in Big Data. Coverage has focused on Spark’s speed and its potential as a replacement for Hadoop’s famously difficult MapReduce engine.
Despite the headlines and hype, Spark is far from enterprise ready, Gartner Research Director Nick Heudecker said. Heudecker is among those who suspect everyone’s watching Spark because no one wants to miss the next Hadoop.
"I think Spark is tremendously interesting. It is also tremendously early," said Heudecker, who specializes in Big Data and NoSQL research. "Everybody is seemingly taking a look at Spark, but the interest is not necessarily driving enterprise adoption yet."
Where Spark Shines
When Heudecker says early, he means it. Spark released version 1.3 in March — just a few weeks before its fifth birthday. Big Data wunderkind Matei Zaharia created Spark in 2009 while at UC Berkeley’s AMPLab — four years after Doug Cutting and Mike Cafarella created Hadoop. In 2013, Spark was donated to the Apache Software Foundation, where it became a top-level project early last year. More than 400 developers have contributed to the project in that time.
Spark has made significant progress in just three years, according to Vaibhav Nivargi, chief architect and co-founder of startup ClearStory Data. Nivargi was among the first engineers at Aster Data, where he helped develop key areas of the Aster MapReduce Platform. Three years ago, he began researching possible technologies for building what he calls a data harmonization engine, a data processing tool that can perform data profiling, map relationships and cull meta data from Big Data sets. He did not want a proprietary solution, yet he wasn’t satisfied with Hadoop’s MapReduce.
"Hadoop essentially opened up the doors for storing really, really large of amounts of data on commodity machines so you don't have to think about which data is more interesting or less interesting. You can capture it all," Nivargi said. "The MapReduce processing paradigm, though, was very restricted in terms of what you can do — simply map and reduce — as well as the added complexities of planning these operations and the I/O overhead of staging the intermediate results."
Then he discovered Spark at Berkley’s AMPLab. He liked what he calls the "true eloquence of Spark," which is the Resilient Distributed Dataset (RDD) that retains the data in-memory for processing.
It’s this in-memory processing that makes Spark shine, experts say. To understand why, remember that Hadoop is a file distribution system. Using its native MapReduce for data processing means reading and writing the data back to the discs or nodes. By contract, Spark is a data processing engine that operates in-memory, on RAM. That makes it fast — really fast, especially when compared to MapReduce.
Apache Spark vs. MapReduce
"Right now the killer use case for Spark seems to be highly iterative processing, like machine learning types of use cases," Heudecker explained. "If you tried to do that in MapReduce, between every map and reduce task data has to be read to disc and written to disc. And that slows things down. Whereas with Spark, that information largely stays in-memory and the processing largely stays in-memory, and so it can iterate much faster."
Apache notes that Spark runs programs up to 100 times faster than Hadoop MapReduce in memory, or 10 times faster on disk. Last year, Spark pureplay DataBricks used Spark to sort 100 terabytes of records within 23 minutes, besting Hadoop MapReduce’s 72 minutes.
For ClearStory Data’s customers, it’s meant massive time savings, Nivargi said, particularly since ClearStory Data adds a front-end interface that means customers don’t have to know Spark’s supported languages, Scala, Java or Python.
"When it comes to the diversity of data they have to deal with, whether it’s databases or files or external sources, they will typically have spent a lot of human resources, as well as time, to try to blend this data using some current local technology. That might be a database or data mart or, in some cases, simply Excel, which stops scaling beyond a particular point in time," Nivargi said. "So our customers see a tremendous value when it comes to bringing 10, 15, 20 sources together and distributing the insights to a large group of people within the company."
In short, speed makes Spark "the de facto standard for data processing," according to Nivargi.
Beyond its famed speed, Spark incorporates a few other enticing Big Data bonuses, including:
- Spark SQL for SQL and structured data processing
- MLlib for machine learning
- GraphX for graph processing
- Spark Streaming, for streaming data
The streaming data, machine learning and graph processing capabilities may make Spark a Big Data heavyweight as graph databases and the Internet of Things become more mainstream. But as Heudecker repeatedly points out, it’s still very early.
"It’s important to note that there is really no 80 percent use case," he said. "There's a whole lot of three percent use cases."
Hadoop Competitor – or Complement?
If it sounds like Spark is in competition with Hadoop, that’s partially true — but it is equally true to say it is a complement to Hadoop. In fact, that seems to always be the case with Spark — it can compete with new technologies and yet, it is not quite the same animal as any of them.
"You're not locked into either ecosystem, and because of that Spark can be both complementary to and competitive to Hadoop," Heudecker said.
Apache Spark Alternatives
Apache Flink is similar to Spark, albeit with some key differences, according to Heudecker. Flink is less mature, but Heudecker predicts the two projects will need to reconcile their similarities and differences at some point.
Apache Tez is sometimes mentioned as a competing solution, and certainly there are use cases where you would want to consider Spark against Tez — if you’re committed to the Hadoop platform, Heudecker added. Both Spark and Tez perform in-memory, both are data processing frameworks and both are open source Apache projects.
It’s comparing apples to oranges, wrote Saggi Neumann, co-founder and CTO of Israel-based Big Data integration company Xplenty. Spark is a general engine for large-scale data processing, but Tez is built atop YARN and is "aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data," Neumann explained. And yet, they are similar enough that Neumann suggests vendor politics also plays a role in which you might choose.
Xplenty is still reviewing Spark as a potential new processing engine, but co-founder and CEO Yaniv Mor told Enterprise Apps Today the company is also looking at potential alternatives like Intel or Cloudera’s Impala. What the company has found thus far is that Spark and other options are "not that mature and production-ready yet." The company is still running benchmarks, he added.
Spark Not Yet Mature
Xplenty's experience seems to be common when it comes to Spark. On paper, it sounds amazing, but like many Big Data options, the reality is more complicated.
The Hadoop distributors are talking about Spark, and there are certainly signs that large technology companies, including IBM, Amazon Web Services and SAP — all sponsors at last year’s Spark Summit — will embrace Spark to some degree. At that same event, Cloudera, Intel, IBM, MapR and DataBricks came out in favor of porting Hive to Spark, effectively making Spark the heir apparent to MapReduce for those distributions.
But so far, Heudecker said the actual adopters are mid- and late-stage startups such as Spark pureplay DataBricks, ClearData Story and Paxata, which uses Spark for data preparation. Other companies primarily use Spark to power dashboards, he added.
It’s still unclear how relevant Spark will be, particularly for enterprise applications, according to Heudecker. Spark needs more maturity, a set of best practices and better tool support, he said.
"With Spark, you get a totally new set of knobs to turn versus something like MapReduce, and you need to figure out those knobs and what the settings need to be," Heudecker said. "So if you're looking for any level of maturity for Spark in the enterprise, it's not there yet. It's simply too early."
Loraine Lawson is a freelance writer specializing in technology and business issues, including integration, healthcare IT, cloud and Big Data.