Skip to content

Commit

Permalink
[SPARK-9902] [MLLIB] Add Java and Python examples to user guide for 1…
Browse files Browse the repository at this point in the history
…-sample KS test

added doc examples for python.

Author: jose.cambronero <[email protected]>

Closes apache#8154 from josepablocam/spark_9902.
  • Loading branch information
jose.cambronero authored and mengxr committed Aug 18, 2015
1 parent f9d1a92 commit c90c605
Showing 1 changed file with 47 additions and 4 deletions.
51 changes: 47 additions & 4 deletions docs/mllib-statistics.md
Original file line number Diff line number Diff line change
Expand Up @@ -438,22 +438,65 @@ run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstra
and interpret the hypothesis tests.

{% highlight scala %}
import org.apache.spark.SparkContext
import org.apache.spark.mllib.stat.Statistics._
import org.apache.spark.mllib.stat.Statistics

val data: RDD[Double] = ... // an RDD of sample data

// run a KS test for the sample versus a standard normal distribution
val testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0, 1)
println(testResult) // summary of the test including the p-value, test statistic,
// and null hypothesis
// if our p-value indicates significance, we can reject the null hypothesis
// and null hypothesis
// if our p-value indicates significance, we can reject the null hypothesis

// perform a KS test using a cumulative distribution function of our making
val myCDF: Double => Double = ...
val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)
{% endhighlight %}
</div>

<div data-lang="java" markdown="1">
[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to
run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
and interpret the hypothesis tests.

{% highlight java %}
import java.util.Arrays;

import org.apache.spark.api.java.JavaDoubleRDD;
import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.mllib.stat.Statistics;
import org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult;

JavaSparkContext jsc = ...
JavaDoubleRDD data = jsc.parallelizeDoubles(Arrays.asList(0.2, 1.0, ...));
KolmogorovSmirnovTestResult testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0.0, 1.0);
// summary of the test including the p-value, test statistic,
// and null hypothesis
// if our p-value indicates significance, we can reject the null hypothesis
System.out.println(testResult);
{% endhighlight %}
</div>

<div data-lang="python" markdown="1">
[`Statistics`](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) provides methods to
run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
and interpret the hypothesis tests.

{% highlight python %}
from pyspark.mllib.stat import Statistics

parallelData = sc.parallelize([1.0, 2.0, ... ])

# run a KS test for the sample versus a standard normal distribution
testResult = Statistics.kolmogorovSmirnovTest(parallelData, "norm", 0, 1)
print(testResult) # summary of the test including the p-value, test statistic,
# and null hypothesis
# if our p-value indicates significance, we can reject the null hypothesis
# Note that the Scala functionality of calling Statistics.kolmogorovSmirnovTest with
# a lambda to calculate the CDF is not made available in the Python API
{% endhighlight %}
</div>
</div>


Expand Down

0 comments on commit c90c605

Please sign in to comment.