| vst {sctransform} | R Documentation |
Apply variance stabilizing transformation to UMI count data using a regularized Negative Binomial regression model. This will remove unwanted effects from UMI data and return Pearson residuals. Uses future_lapply; you can set the number of cores it will use to n with plan(strategy = "multicore", workers = n). If n_genes is set, only a (somewhat-random) subset of genes is used for estimating the initial model parameters.
vst(
umi,
cell_attr = NULL,
latent_var = c("log_umi"),
batch_var = NULL,
latent_var_nonreg = NULL,
n_genes = 2000,
n_cells = NULL,
method = "poisson",
do_regularize = TRUE,
theta_regularization = "od_factor",
res_clip_range = c(-sqrt(ncol(umi)), sqrt(ncol(umi))),
bin_size = 500,
min_cells = 5,
residual_type = "pearson",
return_cell_attr = FALSE,
return_gene_attr = TRUE,
return_corrected_umi = FALSE,
min_variance = -Inf,
bw_adjust = 3,
gmean_eps = 1,
theta_estimation_fun = "theta.ml",
theta_given = NULL,
verbosity = 2,
verbose = NULL,
show_progress = NULL
)
umi |
A matrix of UMI counts with genes as rows and cells as columns |
cell_attr |
A data frame containing the dependent variables; if omitted a data frame with umi and gene will be generated |
latent_var |
The independent variables to regress out as a character vector; must match column names in cell_attr; default is c("log_umi") |
batch_var |
The dependent variables indicating which batch a cell belongs to; no batch interaction terms used if omiited |
latent_var_nonreg |
The non-regularized dependent variables to regress out as a character vector; must match column names in cell_attr; default is NULL |
n_genes |
Number of genes to use when estimating parameters (default uses 2000 genes, set to NULL to use all genes) |
n_cells |
Number of cells to use when estimating parameters (default uses all cells) |
method |
Method to use for initial parameter estimation; one of 'poisson', 'qpoisson', 'nb_fast', 'nb', 'nb_theta_given', 'glmGamPoi' |
do_regularize |
Boolean that, if set to FALSE, will bypass parameter regularization and use all genes in first step (ignoring n_genes); default is FALSE |
theta_regularization |
Method to use to regularize theta; use 'log_theta' for the behavior prior to version 0.3; default is 'od_factor' |
res_clip_range |
Numeric of length two specifying the min and max values the results will be clipped to; default is c(-sqrt(ncol(umi)), sqrt(ncol(umi))) |
bin_size |
Number of genes to process simultaneously; this will determine how often the progress bars are updated and how much memory is being used; default is 500 |
min_cells |
Only use genes that have been detected in at least this many cells; default is 5 |
residual_type |
What type of residuals to return; can be 'pearson', 'deviance', or 'none'; default is 'pearson' |
return_cell_attr |
Make cell attributes part of the output; default is FALSE |
return_gene_attr |
Calculate gene attributes and make part of output; default is TRUE |
return_corrected_umi |
If set to TRUE output will contain corrected UMI matrix; see |
min_variance |
Lower bound for the estimated variance for any gene in any cell when calculating pearson residual; default is -Inf |
bw_adjust |
Kernel bandwidth adjustment factor used during regurlarization; factor will be applied to output of bw.SJ; default is 3 |
gmean_eps |
Small value added when calculating geometric mean of a gene to avoid log(0); default is 1 |
theta_estimation_fun |
Character string indicating which method to use to estimate theta (when method = poisson); default is 'theta.ml', but 'theta.mm' seems to be a good and fast alternative |
theta_given |
Named numeric vector of fixed theta values for the genes; will only be used if method is set to nb_theta_given; default is NULL |
verbosity |
An integer specifying whether to show only messages (1), messages and progress bars (2) or nothing (0) while the function is running; default is 2 |
verbose |
Deprecated; use verbosity instead |
show_progress |
Deprecated; use verbosity instead |
A list with components
y |
Matrix of transformed data, i.e. Pearson residuals, or deviance residuals; empty if |
umi_corrected |
Matrix of corrected UMI counts (optional) |
model_str |
Character representation of the model formula |
model_pars |
Matrix of estimated model parameters per gene (theta and regression coefficients) |
model_pars_outliers |
Vector indicating whether a gene was considered to be an outlier |
model_pars_fit |
Matrix of fitted / regularized model parameters |
model_str_nonreg |
Character representation of model for non-regularized variables |
model_pars_nonreg |
Model parameters for non-regularized variables |
genes_log_gmean_step1 |
log-geometric mean of genes used in initial step of parameter estimation |
cells_step1 |
Cells used in initial step of parameter estimation |
arguments |
List of function call arguments |
cell_attr |
Data frame of cell meta data (optional) |
gene_attr |
Data frame with gene attributes such as mean, detection rate, etc. (optional) |
times |
Time stamps at various points in the function |
In the first step of the algorithm, per-gene glm model parameters are learned. This step can be done
on a subset of genes and/or cells to speed things up.
If method is set to 'poisson', a poisson regression is done and
the negative binomial theta parameter is estimated using the response residuals in
theta_estimation_fun.
If method is set to 'qpoisson', coefficients and overdispersion (phi) are estimated by quasi
poisson regression and theta is estimated based on phi and the mean fitted value - this is currently
the fastest method with results very similar to 'glmGamPoi'
If method is set to 'nb_fast', coefficients and theta are estimated as in the
'poisson' method, but coefficients are then re-estimated using a proper negative binomial
model in a second call to glm with family = MASS::negative.binomial(theta = theta).
If method is set to 'nb', coefficients and theta are estimated by a single call to
MASS::glm.nb.
If method is set to 'glmGamPoi', coefficients and theta are estimated by a single call to
glmGamPoi::glm_gp.
vst_out <- vst(pbmc)