Hadoop distributor Cloudera has released a commercial edition of the Apache Spark program, which analyzes data in real time from within Cloudera’s Hadoop environments.
The release has the potential to expand Hadoop’s use for stream processing and faster machine learning.
“Data scientists love Spark,” said Matt Brandwein, Cloudera director of product marketing.
Spark does a good job at machine learning, which requires multiple iterations over the same data set, Brandwein said.
“Historically, you’d do that stuff with MapReduce, if you’re using Hadoop. But MapReduce is really slow,” Brandwein said, referring to how the MapReduce framework requires many multiple reads and writes to disk to carry out machine learning duties. Spark can do this task while the data is still in working memory. Maintainers of the software claim that Spark can run programs up to 100 times faster than Hadoop itself, thanks to its in-memory design model. For complete post see here