Spark Machine Learning Library Tutorial

Spark MLlib for Basic Statistics

MLlib statistics tutorial and all of the examples can be found here. We used Spark Python API for our tutorial. Modular hierarchy and individual examples for Spark Python API MLlib can be found here.


corr(x, y = None, method = "pearson" | "spearman")
Correlations provide quantitative measurements of the statistical dependence between two random variables. Implementations for correlations are provided under mllib.stat.Statistics.
Pearson correlation, Reminder:
The population correlation coefficient ρX,Y between two random variables X and Y with expected values μX and μY and standard deviations σX and σY is defined as:

where E is the expected value operator, cov means covariance, and corr is a widely used alternative notation for the correlation coefficient.

Spearman Correlations

The Spearman's Rank Correlation Coefficient is used to discover the strength of a link between two sets of data. Pearson benchmarks linear relationship, Spearman benchmarks monotonic relationship.

Worked Example

Nine students held their breath, once after breathing normally and relaxing for one minute, and once after hyperventilating for one minute. The table indicates how long (in sec) they were able to hold their breath. Is there an association between the two variables?

--- Downloads/spark-1.2.1-bin-hadoop2.4 » ./bin/pyspark examples/src/main/python/mllib/ > pearson.txt
--- Downloads/spark-1.2.1-bin-hadoop2.4 » cat pearson.txt
Correlation: 0.966194

--- Downloads/spark-1.2.1-bin-hadoop2.4 » ./bin/pyspark examples/src/main/python/mllib/ > spearman.txt
--- Downloads/spark-1.2.1-bin-hadoop2.4 » cat spearman.txt
Correlation: 0.805881
Correlation: 0.672727


sampleByKey(withReplacement, fractions, seed)
sampleByKeyExact(withReplacement, fractions, seed)

It is common for a large population to consist of various-sized subpopulations (strata), for example, a training set with many more positive instances than negatives. To sample such populations, it is advantageous to sample each stratum independently to reduce the total variance or to represent small but important strata. This sampling design is called stratified sampling. We provide two versions of stratified sampling, sampleByKey and sampleByKeyExact. Both apply to an RDD of key-value pairs with key indicating the stratum, and both take a map from users that specifies the sampling probability for each stratum.

--- Downloads/spark-1.2.1-bin-hadoop2.4 » ./bin/pyspark examples/src/main/python/mllib/ > spl.txt
--- Downloads/spark-1.2.1-bin-hadoop2.4 » cat spl.txt
Loaded data with 100 examples from file:
Sampling RDD using fraction 0.1. Expected sample size = 10.
RDD.sample(): sample has 12 examples
RDD.takeSample(): sample has 10 examples

Keyed data using label (Int) as key ==> Orig
Sampled 15 examples using approximate stratified sampling (by label). ==> Sample
Fractions of examples with key
Key   Orig   Sample
0        0.43    0.266667
1        0.57    0.733333

Hypothesis Testing: goodness of fit; independence test

chiSqTest(observed: Vector, expected: Vector)
chiSqTest(observed: Matrix)
chiSqTest(data: RDD[LabeledPoint])

What is hypothesis testing?
Example: Your car will not start. You put forward the hypothesis that "the car that does not start because there is no petrol". You check the fuel gauge to either reject or accept the hypothesis. If you find there is petrol, you reject the hypothesis. Next, you hypothesise that "the car did not start because the spark plugs are dirty". You check the spark plugs to determine if they are dirty and accept or reject the hypothesis accordingly.

Hypothesis testing is a tool in statistics to determine whether a result is statistically significant, whether this result occurred by chance or not. Mllib currently supports Pearson's chi-squared ( χ2) tests for goodness of fit and independence, i.e. We Use the chi-square test for independence to determine whether there is a significant relationship between two variables (click here for more details). The input data types determine whether the goodness of fit or the independence test is conducted. The goodness of fit test requires an input type of Vector, whereas the independence test requires a Matrix as input.

Hypothesis testing is essential for data-driven applications. A test result shows the statistical significance of an event unlikely to have occurred by chance. For example, we can test whether there is a significant association between two samples via independence tests. Spark implements chi-squared tests for goodness-of-fit and independence.

Random Data generation

normalRDD(sc, size, [numPartitions, seed])
normalVectorRDD(sc, numRows, numCols, [numPartitions, seed])

Random data generation is useful for testing of existing algorithms and implementing randomized algorithms, prototyping and so on. MLlib provides methods under mllib.random.RandomRDDs for generating RDDs that contains i.i.d. values drawn from a distribution, e.g., uniform, standard normal, or Poisson.

--- Downloads/spark-1.2.1-bin-hadoop2.4 » ./bin/pyspark examples/src/main/python/mllib/ > rgg.txt
--- Downloads/spark-1.2.1-bin-hadoop2.4 » cat rgg.txt
Generated RDD of 10000 examples sampled from the standard normal distribution
First 5 samples:

Generated RDD of 10000 examples of length-2 vectors.
First 5 samples:
[ 0.91889733 -1.50144077]
[ 0.18241804 -0.06201148]
[-1.3610074 -2.42020844]
[ 1.15290288 -0.71680542]
[ 0.21139855 -0.30846834]