| varsel {projpred} | R Documentation |
Perform the projection predictive variable selection for GLMs, GLMMs, GAMs, and GAMMs. This variable selection consists of a search part and an evaluation part. The search part determines the solution path, i.e., the best submodel for each submodel size (number of predictor terms). The evaluation part determines the predictive performance of the submodels along the solution path.
varsel(object, ...) ## Default S3 method: varsel(object, ...) ## S3 method for class 'refmodel' varsel( object, d_test = NULL, method = NULL, ndraws = NULL, nclusters = 20, ndraws_pred = 400, nclusters_pred = NULL, refit_prj = !inherits(object, "datafit"), nterms_max = NULL, verbose = TRUE, lambda_min_ratio = 1e-05, nlambda = 150, thresh = 1e-06, regul = 1e-04, penalty = NULL, search_terms = NULL, seed = sample.int(.Machine$integer.max, 1), ... )
object |
An object of class |
... |
Arguments passed to |
d_test |
For internal use only. A |
method |
The method for the search part. Possible options are |
ndraws |
Number of posterior draws used in the search part. Ignored if
|
nclusters |
Number of clusters of posterior draws used in the search
part. Ignored in case of L1 search (because L1 search always uses a single
cluster). For the meaning of |
ndraws_pred |
Only relevant if |
nclusters_pred |
Only relevant if |
refit_prj |
A single logical value indicating whether to fit the
submodels along the solution path again ( |
nterms_max |
Maximum number of predictor terms until which the search is
continued. If |
verbose |
A single logical value indicating whether to print out additional information during the computations. |
lambda_min_ratio |
Only relevant for L1 search. Ratio between the smallest and largest lambda in the L1-penalized search. This parameter essentially determines how long the search is carried out, i.e., how large submodels are explored. No need to change this unless the program gives a warning about this. |
nlambda |
Only relevant for L1 search. Number of values in the lambda grid for L1-penalized search. No need to change this unless the program gives a warning about this. |
thresh |
Only relevant for L1 search. Convergence threshold when computing the L1 path. Usually, there is no need to change this. |
regul |
A number giving the amount of ridge regularization when projecting onto (i.e., fitting) submodels which are GLMs. Usually there is no need for regularization, but sometimes we need to add some regularization to avoid numerical problems. |
penalty |
Only relevant for L1 search. A numeric vector determining the
relative penalties or costs for the predictors. A value of |
search_terms |
Only relevant for forward search. A custom character
vector of predictor term blocks to consider for the search. Section
"Details" below describes more precisely what "predictor term block" means.
The intercept ( |
seed |
Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. If |
Arguments ndraws, nclusters, nclusters_pred, and ndraws_pred
are automatically truncated at the number of posterior draws in the
reference model (which is 1 for datafits). Using less draws or clusters
in ndraws, nclusters, nclusters_pred, or ndraws_pred than posterior
draws in the reference model may result in slightly inaccurate projection
performance. Increasing these arguments affects the computation time
linearly.
For argument method, there are some restrictions: For a reference model
with multilevel or additive formula terms, only the forward search is
available. Furthermore, argument search_terms requires a forward search
to take effect.
L1 search is faster than forward search, but forward search may be more accurate. Furthermore, forward search may find a sparser model with comparable performance to that found by L1 search, but it may also start overfitting when more predictors are added.
An L1 search may select interaction terms before the corresponding main terms are selected. If this is undesired, choose the forward search instead.
The elements of the search_terms character vector don't need to be
individual predictor terms. Instead, they can be building blocks consisting
of several predictor terms connected by the + symbol. To understand how
these building blocks works, it is important to know how projpred's
forward search works: It starts with an empty vector chosen which will
later contain already selected predictor terms. Then, the search iterates
over model sizes j = 1, ..., J. The candidate
models at model size j are constructed from those elements from
search_terms which yield model size j when combined with the
chosen predictor terms. Note that sometimes, there may be no candidate
models for model size j. Also note that internally, search_terms is
expanded to include the intercept ("1"), so the first step of the search
(model size 1) always consists of the intercept-only model as the only
candidate.
As a search_terms example, consider a reference model with formula y ~ x1 + x2 + x3. Then, to ensure that x1 is always included in the
candidate models, specify search_terms = c("x1", "x1 + x2", "x1 + x3", "x1 + x2 + x3"). This search would start with y ~ 1 as the only
candidate at model size 1. At model size 2, y ~ x1 would be the only
candidate. At model size 3, y ~ x1 + x2 and y ~ x1 + x3 would be the
two candidates. At the last model size of 4, y ~ x1 + x2 + x3 would be
the only candidate. As another example, to exclude x1 from the search,
specify search_terms = c("x2", "x3", "x2 + x3").
An object of class vsel. The elements of this object are not meant
to be accessed directly but instead via helper functions (see the vignette
or type ?projpred).
if (requireNamespace("rstanarm", quietly = TRUE)) {
# Data:
dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)
# The "stanreg" fit which will be used as the reference model (with small
# values for `chains` and `iter`, but only for technical reasons in this
# example; this is not recommended in general):
fit <- rstanarm::stan_glm(
y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876
)
# Variable selection (here without cross-validation and with small values
# for `nterms_max`, `nclusters`, and `nclusters_pred`, but only for the
# sake of speed in this example; this is not recommended in general):
vs <- varsel(fit, nterms_max = 3, nclusters = 5, nclusters_pred = 10,
seed = 5555)
# Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`,
# and `?solution_terms.vsel` for possible post-processing functions.
}