Skip to content

Commit

Permalink
[SPARK-9713] [ML] Document SparkR MLlib glm() integration in Spark 1.5
Browse files Browse the repository at this point in the history
This documents the use of R model formulae in the SparkR guide. Also fixes some bugs in the R api doc.

mengxr

Author: Eric Liang <[email protected]>

Closes apache#8085 from ericl/docs.
  • Loading branch information
ericl authored and mengxr committed Aug 12, 2015
1 parent 3ef0f32 commit 74a293f
Show file tree
Hide file tree
Showing 3 changed files with 42 additions and 7 deletions.
4 changes: 2 additions & 2 deletions R/pkg/R/generics.R
Original file line number Diff line number Diff line change
Expand Up @@ -535,8 +535,8 @@ setGeneric("showDF", function(x,...) { standardGeneric("showDF") })
#' @export
setGeneric("summarize", function(x,...) { standardGeneric("summarize") })

##' rdname summary
##' @export
#' @rdname summary
#' @export
setGeneric("summary", function(x, ...) { standardGeneric("summary") })

# @rdname tojson
Expand Down
8 changes: 4 additions & 4 deletions R/pkg/R/mllib.R
Original file line number Diff line number Diff line change
Expand Up @@ -56,10 +56,10 @@ setMethod("glm", signature(formula = "formula", family = "ANY", data = "DataFram
#'
#' Makes predictions from a model produced by glm(), similarly to R's predict().
#'
#' @param model A fitted MLlib model
#' @param object A fitted MLlib model
#' @param newData DataFrame for testing
#' @return DataFrame containing predicted values
#' @rdname glm
#' @rdname predict
#' @export
#' @examples
#'\dontrun{
Expand All @@ -76,10 +76,10 @@ setMethod("predict", signature(object = "PipelineModel"),
#'
#' Returns the summary of a model produced by glm(), similarly to R's summary().
#'
#' @param model A fitted MLlib model
#' @param x A fitted MLlib model
#' @return a list with a 'coefficient' component, which is the matrix of coefficients. See
#' summary.glm for more information.
#' @rdname glm
#' @rdname summary
#' @export
#' @examples
#'\dontrun{
Expand Down
37 changes: 36 additions & 1 deletion docs/sparkr.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ title: SparkR (R on Spark)
SparkR is an R package that provides a light-weight frontend to use Apache Spark from R.
In Spark {{site.SPARK_VERSION}}, SparkR provides a distributed data frame implementation that
supports operations like selection, filtering, aggregation etc. (similar to R data frames,
[dplyr](https://github.com/hadley/dplyr)) but on large datasets.
[dplyr](https://github.com/hadley/dplyr)) but on large datasets. SparkR also supports distributed
machine learning using MLlib.

# SparkR DataFrames

Expand Down Expand Up @@ -230,3 +231,37 @@ head(teenagers)

{% endhighlight %}
</div>

# Machine Learning

SparkR allows the fitting of generalized linear models over DataFrames using the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib to train a model of the specified family. Currently the gaussian and binomial families are supported. We support a subset of the available R formula operators for model fitting, including '~', '.', '+', and '-'. The example below shows the use of building a gaussian GLM model using SparkR.

<div data-lang="r" markdown="1">
{% highlight r %}
# Create the DataFrame
df <- createDataFrame(sqlContext, iris)

# Fit a linear model over the dataset.
model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")

# Model coefficients are returned in a similar format to R's native glm().
summary(model)
##$coefficients
## Estimate
##(Intercept) 2.2513930
##Sepal_Width 0.8035609
##Species_versicolor 1.4587432
##Species_virginica 1.9468169

# Make predictions based on the model.
predictions <- predict(model, newData = df)
head(select(predictions, "Sepal_Length", "prediction"))
## Sepal_Length prediction
##1 5.1 5.063856
##2 4.9 4.662076
##3 4.7 4.822788
##4 4.6 4.742432
##5 5.0 5.144212
##6 5.4 5.385281
{% endhighlight %}
</div>

0 comments on commit 74a293f

Please sign in to comment.