Spark MLlib for Basic Statistics
MLlib statistics tutorial and all of the examples can be found here. We used Spark Python API for our tutorial. Modular hierarchy and individual examples for Spark Python API MLlib can be found here.
Correlations
corr(x, y = None, method = "pearson"  "spearman")
Correlations provide quantitative measurements of the statistical dependence between two random variables. Implementations for correlations are provided under mllib.stat.Statistics.
Pearson correlation, Reminder:
The population correlation coefficient ρX,Y between two random variables X and Y with expected values μX and μY and standard deviations σX and σY is defined as:
where E is the expected value operator, cov means covariance, and corr is a widely used alternative notation for the correlation coefficient.
Spearman Correlations
The Spearman's Rank Correlation Coefficient is used to discover the strength of a link between two sets of data. Pearson benchmarks linear relationship, Spearman benchmarks monotonic relationship.
Worked Example
Nine students held their breath, once after breathing normally and relaxing for one minute, and once after hyperventilating for one minute. The table indicates how long (in sec) they were able to hold their breath. Is there an association between the two variables?
 Downloads/spark1.2.1binhadoop2.4 » ./bin/pyspark examples/src/main/python/mllib/pearson.py > pearson.txt
 Downloads/spark1.2.1binhadoop2.4 » cat pearson.txt
Correlation: 0.966194
 Downloads/spark1.2.1binhadoop2.4 » ./bin/pyspark examples/src/main/python/mllib/spearman.py > spearman.txt
 Downloads/spark1.2.1binhadoop2.4 » cat spearman.txt
Correlation: 0.805881
Correlation: 0.672727
Sampling
sampleByKey(withReplacement, fractions, seed)
sampleByKeyExact(withReplacement, fractions, seed)
It is common for a large population to consist of varioussized subpopulations (strata), for example, a training set with many more positive instances than negatives. To sample such populations, it is advantageous to sample each stratum independently to reduce the total variance or to represent small but important strata. This sampling design is called stratified sampling. We provide two versions of stratified sampling, sampleByKey and sampleByKeyExact. Both apply to an RDD of keyvalue pairs with key indicating the stratum, and both take a map from users that specifies the sampling probability for each stratum.
 Downloads/spark1.2.1binhadoop2.4 » ./bin/pyspark examples/src/main/python/mllib/sampled_rdds.py > spl.txt
 Downloads/spark1.2.1binhadoop2.4 » cat spl.txt
Loaded data with 100 examples from file: data/mllib/sample_binary_classification_data.txt
Sampling RDD using fraction 0.1. Expected sample size = 10.
RDD.sample(): sample has 12 examples
RDD.takeSample(): sample has 10 examples
Keyed data using label (Int) as key ==> Orig
Sampled 15 examples using approximate stratified sampling (by label). ==> Sample
Fractions of examples with key
Key Orig Sample
0 0.43 0.266667
1 0.57 0.733333
Hypothesis Testing: goodness of fit; independence test
chiSqTest(observed: Vector, expected: Vector)
chiSqTest(observed: Matrix)
chiSqTest(data: RDD[LabeledPoint])
What is hypothesis testing?
Example: Your car will not start. You put forward the hypothesis that "the car that does not start because there is no petrol". You check the fuel gauge to either reject or accept the hypothesis. If you find there is petrol, you reject the hypothesis.
Next, you hypothesise that "the car did not start because the spark plugs are dirty". You check the spark plugs to determine if they are dirty and accept or reject the hypothesis accordingly.
Hypothesis testing is a tool in statistics to determine whether a result is statistically significant, whether this result occurred by chance or not. Mllib currently supports Pearson's chisquared ( χ2) tests for goodness of fit and independence, i.e. We Use the chisquare test for independence to determine whether there is a significant relationship between two variables (click here for more details). The input data types determine whether the goodness of fit or the independence test is conducted. The goodness of fit test requires an input type of Vector, whereas the independence test requires a Matrix as input.
Hypothesis testing is essential for datadriven applications. A test result shows the statistical significance of an event unlikely to have occurred by chance. For example, we can test whether there is a significant association between two samples via independence tests. Spark implements chisquared tests for goodnessoffit and independence.
Random Data generation
normalRDD(sc, size, [numPartitions, seed])
normalVectorRDD(sc, numRows, numCols, [numPartitions, seed])
Random data generation is useful for testing of existing algorithms and implementing randomized algorithms, prototyping and so on. MLlib provides methods under mllib.random.RandomRDDs for generating RDDs that contains i.i.d. values drawn from a distribution, e.g., uniform, standard normal, or Poisson.
 Downloads/spark1.2.1binhadoop2.4 » ./bin/pyspark examples/src/main/python/mllib/random_rdd_generation.py > rgg.txt
 Downloads/spark1.2.1binhadoop2.4 » cat rgg.txt
Generated RDD of 10000 examples sampled from the standard normal distribution
First 5 samples:
0.717664108581
0.893664635151
1.0540084425
0.732419729409
1.57860965545
Generated RDD of 10000 examples of length2 vectors.
First 5 samples:
[ 0.91889733 1.50144077]
[ 0.18241804 0.06201148]
[1.3610074 2.42020844]
[ 1.15290288 0.71680542]
[ 0.21139855 0.30846834]
