spark.glm {SparkR} | R Documentation |
Fits generalized linear model against a SparkDataFrame.
Users can call summary
to print a summary of the fitted model, predict
to make
predictions on new data, and write.ml
/read.ml
to save/load fitted models.
spark.glm(data, formula, ...) ## S4 method for signature 'SparkDataFrame,formula' spark.glm(data, formula, family = gaussian, tol = 1e-06, maxIter = 25, weightCol = NULL, regParam = 0, var.power = 0, link.power = 1 - var.power) ## S4 method for signature 'GeneralizedLinearRegressionModel' summary(object) ## S3 method for class 'summary.GeneralizedLinearRegressionModel' print(x, ...) ## S4 method for signature 'GeneralizedLinearRegressionModel' predict(object, newData) ## S4 method for signature 'GeneralizedLinearRegressionModel,character' write.ml(object, path, overwrite = FALSE)
data |
a SparkDataFrame for training. |
formula |
a symbolic description of the model to be fitted. Currently only a few formula operators are supported, including '~', '.', ':', '+', and '-'. |
... |
additional arguments passed to the method. |
family |
a description of the error distribution and link function to be used in the model.
This can be a character string naming a family function, a family function or
the result of a call to a family function. Refer R family at
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/family.html.
Currently these families are supported: Note that there are two ways to specify the tweedie family.
|
tol |
positive convergence tolerance of iterations. |
maxIter |
integer giving the maximal number of IRLS iterations. |
weightCol |
the weight column name. If this is not set or |
regParam |
regularization parameter for L2 regularization. |
var.power |
the power in the variance function of the Tweedie distribution which provides the relationship between the variance and mean of the distribution. Only applicable to the Tweedie family. |
link.power |
the index in the power link function. Only applicable to the Tweedie family. |
object |
a fitted generalized linear model. |
x |
summary object of fitted generalized linear model returned by |
newData |
a SparkDataFrame for testing. |
path |
the directory where the model is saved. |
overwrite |
overwrites or not if the output path already exists. Default is FALSE which means throw exception if the output path exists. |
spark.glm
returns a fitted generalized linear model.
summary
returns summary information of the fitted model, which is a list.
The list of components includes at least the coefficients
(coefficients matrix, which includes
coefficients, standard error of coefficients, t value and p value),
null.deviance
(null/residual degrees of freedom), aic
(AIC)
and iter
(number of iterations IRLS takes). If there are collinear columns in the data,
the coefficients matrix only provides coefficients.
predict
returns a SparkDataFrame containing predicted labels in a column named
"prediction".
spark.glm since 2.0.0
summary(GeneralizedLinearRegressionModel) since 2.0.0
print.summary.GeneralizedLinearRegressionModel since 2.0.0
predict(GeneralizedLinearRegressionModel) since 1.5.0
write.ml(GeneralizedLinearRegressionModel, character) since 2.0.0
## Not run: sparkR.session() t <- as.data.frame(Titanic) df <- createDataFrame(t) model <- spark.glm(df, Freq ~ Sex + Age, family = "gaussian") summary(model) # fitted values on training data fitted <- predict(model, df) head(select(fitted, "Freq", "prediction")) # save fitted model to input path path <- "path/to/model" write.ml(model, path) # can also read back the saved model and print savedModel <- read.ml(path) summary(savedModel) # fit tweedie model model <- spark.glm(df, Freq ~ Sex + Age, family = "tweedie", var.power = 1.2, link.power = 0) summary(model) # use the tweedie family from statmod library(statmod) model <- spark.glm(df, Freq ~ Sex + Age, family = tweedie(1.2, 0)) summary(model) ## End(Not run)