Apache Mahout, a machine learning library for Hadoop since 2009, is joining the exodus away from MapReduce. The project’s community has decided to rework Mahout to support the increasingly popular Apache Spark in-memory data-processing framework, as well as the H2O engine for running machine learning and mathematical workloads at scale.
While data processing in Hadoop has traditionally been done using MapReduce, the batch-oriented framework has fallen out of vogue as users began demanding lower-latency processing for certain types of workloads — such as machine learning. However, nobody really wants to abandon Hadoop entirely because it’s still great for storing lots of data and many still use MapReduce for most of their workloads. Spark, which was developed at the University of California, Berkeley, has stepped in to fill that void in a growing number of cases where speed and ease of programming really matter.
H2O was developed separately by a startup called 0xdata (pronounced hexadata), although it’s also available as open source software. It’s an in-memory data engine specifically designed for running various types of types of statisical computations — including deep learning models — on data stored in the Hadoop Distributed File System.