Spark Machine Learning Library Tutorial

Spark Overview

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs. An execution graph describes the possible states of execution and the states between them. Spark also supports a set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Review of Spark Machine Language Library (MLlib)

MLlib is Spark's machine learning library, focusing on learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.

Why MLlib? It is built on Apache Spark, which is a fast and general engine for large scale processing. Supposedly, running times or up to 100x faster than Hadoop MapReduce, or 10x faster on disk. Supports writing applications in Java, Scala, or Python.


How to get Spark MLlib? Just Install Spark. There is nothing special about MLlib installation, it is already included in Spark.

Download Spark from the downloads page of the project website. This documentation is for Spark version 1.1.0. The downloads page contains Spark packages for many popular HDFS versions. If you'd like to build Spark from scratch, visit building Spark with Maven. The screenshot below shows the options in the download page. The version we used for this tutorial can be seen from this picture.

Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It's easy to run locally on one machine — all you need is to have java installed on your system PATH, or the JAVA_HOME environment variable pointing to a Java installation. Spark requires Java 6+ and Python 2.6+. For the Scala API, Spark 1.1.0 uses Scala 2.10. We use Spark's Python API for MLlib.

Running Spark Python Shell

--- Downloads » cd spark-1.2.1-bin-hadoop2.4
--- Downloads/spark-1.2.1-bin-hadoop2.4 » ls
LICENSE NOTICE RELEASE bin conf data ec2 examples lib python sbin
--- Downloads/spark-1.2.1-bin-hadoop2.4 » cd bin
--- Downloads/spark-1.2.1-bin-hadoop2.4/bin » ls
beeline pyspark.cmd run-example.cmd spark-class.cmd spark-shell.cmd spark-submit beeline.cmd pyspark2.cmd run-example2.cmd spark-class2.cmd spark-shell2.cmd spark-submit.cmd compute-classpath.cmd pyspark run-example spark-class spark-shell spark-sql spark-submit2.cmd
--- Downloads/spark-1.2.1-bin-hadoop2.4/bin » ./pyspark
Python 2.7.5 (default, Mar 9 2014, 22:15:05)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.2.1

Using Python version 2.7.5 (default, Mar 9 2014 22:15:05)
SparkContext available as sc.
>>> start writing python code here :-)

Spark MapReduce Example using Python

We calculate value of pi, similar to what the previous team did in Scala here.

--- Downloads/spark-1.2.1-bin-hadoop2.4 » ./bin/pyspark examples/src/main/python/ > pi.txt