| holdoutvimp {randomForestSRC} | R Documentation |
Hold out VIMP is calculated from the error rate for trees grown with and without a variable. Applies to all families.
## S3 method for class 'rfsrc' holdoutvimp(formula, data, ntree = 1000 * ncol(data) / vtry, ntree.max = 2000, nsplit = 10, ntime = 50, mtry = NULL, vtry = 1, fast = FALSE, verbose = TRUE, ...)
formula |
A symbolic description of the model to be fit. |
data |
Data frame containing the y-outcome and x-variables. |
ntree |
Number of trees used for growing the forest. |
ntree.max |
Maximum number of trees used when calculating prediction error for determing hold out VIMP. |
nsplit |
Non-negative integer value specifying number of random split points used to split a node (deterministic splitting corresponds to the value zero and is much slower). |
ntime |
Integer value used for survival to
constrain ensemble calculations to a grid of |
mtry |
Number of variables randomly selected as candidates for splitting a node. |
vtry |
Number of variables randomly selected to be held out when growning a tree. |
fast |
Use fast random forests, |
verbose |
Provide verbose output? |
... |
Further arguments to be passed to |
Prior to growing a tree, a random set of vtry features are held
out. Tree growing proceeds as usual with the remaining features.
Once the forest is grown, hold out VIMP for a given variable v is
calculated as follows. Gather all trees where v was held out and
calculate OOB prediction error. Next gather all trees were v was not
held out and calculate OOB prediction error. Hold out VIMP for v is
the difference between these two values. Thus hold out VIMP measures
the importance of a variable when that variable is truly removed from
tree growing.
Accuracy of hold out VIMP depends heavily on the size of the forest.
If the number of trees is too small, then number of trees where v is
held out will be small, and the resulting OOB error will have high
variance. Thus, ntree should be set fairly high - we recommend
using 1000 times the number of features. Increasing vtry is
another way to increase number of hold out trees. In particular,
number of trees needed should decrease linearly with vtry.
Keep in mind however that intrepretation of holdout VIMP is altered
when vtry is different than 1. This is likely to be more of a
concern in low dimensional settings.
Uses the new get.tree option in predict to extract
specific trees from a forest and the hidden option vtry in
rfsrc. The latter creates a hidden array holdout.array
of zeroes and ones indicating which variable to hold out in a tree
where number of rows equals number of features and number of columns
equals number of trees. The array can also be passed as a hidden
option but is not checked for coherence so users should be careful
when doing so.
Hold out VIMP for each variable. For multivariate forests, hold out VIMP is calculated for each of the target outcomes.
Hemant Ishwaran and Udaya B. Kogalur
Lu M. and Ishwaran H. (2018). Expert Opinion: A prediction-based alternative to p-values in regression models. J. Thoracic and Cardiovascular Surgery, 155(3), 1130–1136.
## ------------------------------------------------------------
## Boston housing example
## ------------------------------------------------------------
if (library("mlbench", logical.return = TRUE)) {
data(BostonHousing)
hv <- holdoutvimp(medv ~ ., BostonHousing)
print(hv)
}
## ------------------------------------------------------------
## Multivariate regression analysis
## ------------------------------------------------------------
hv <- holdoutvimp(cbind(mpg, cyl) ~., mtcars)
print(hv)
## ------------------------------------------------------------
## White wine classification example
## ------------------------------------------------------------
data(wine, package = "randomForestSRC")
wine$quality <- factor(wine$quality)
hv <- holdoutvimp(quality ~ ., wine, vtry = 5)
print(100 * hv)
## ------------------------------------------------------------
## pbc survival example
## ------------------------------------------------------------
data(pbc, package = "randomForestSRC")
hv <- holdoutvimp(Surv(days, status) ~ ., pbc, splitrule = "random")
print(100 * hv)
## ------------------------------------------------------------
## WIHS competing risk example
## ------------------------------------------------------------
data(wihs, package = "randomForestSRC")
hv <- holdoutvimp(Surv(time, status) ~ ., wihs, ntree = 1000)
print(100 * hv)