| holdout.vimp.rfsrc {randomForestSRC} | R Documentation |
Hold out VIMP is calculated from the error rate of mini ensembles of trees (blocks of trees) grown with and without a variable. Applies to all families.
## S3 method for class 'rfsrc'
holdout.vimp(formula, data,
ntree = function(p, vtry){1000 * p / vtry},
nsplit = 10,
ntime = 50,
sampsize = function(x){x * .632},
samptype = "swor",
block.size = 10,
vtry = 1,
...)
formula |
A symbolic description of the model to be fit. |
data |
Data frame containing the y-outcome and x-variables. |
ntree |
Function specifying requested number of trees used for growing the forest. Inputs are dimension and number of holdout variables. The requested number of trees can also be a number. |
nsplit |
Non-negative integer value specifying number of random split points used to split a node (deterministic splitting corresponds to the value zero and is much slower). |
ntime |
Integer value used for survival to
constrain ensemble calculations to a grid of |
sampsize |
Function specifying size of subsampled data. Can also be a number. |
samptype |
Type of bootstrap used. |
vtry |
Number of variables randomly selected to be held out when growning a tree. |
block.size |
Specifies number of trees in a block when calculating holdout variable importance. |
... |
Further arguments to be passed to |
Holdout variable importance (holdout VIMP) is based on comparing the
results of two mini forests of trees (blocks of trees), the first in
which a random set of vtry features are held out (the holdout
forest), and the second in which no features are held out (the
baseline forest). The mini forests are constructed from blocks of
trees, where if the block size is one, then this is similar to the
idea of Breiman-Cutler which calculates permuation VIMP using one tree
at a time.
If a feature is held out in a block of trees, we refer to this as the
(feature, block) pair. The bootstrap for the trees in a (feature,
block) pair are identical in both forests. That is, the holdout block
is grow by holding out the feature, and the baseline block is grown
over the same trees, with the same bootstrap, but without holding out
any features. vtry controls how many features are held out in
every tree. If set to one (default), only one variable is held out in
every tree. Once a (feature, block) of trees has been grown, holdout
VIMP for a given variable v is calculated as follows. Gather the
block of trees where the feature was held out (from the holdout
forest) and calculate OOB prediction error. Next gather the
corresponding block of trees where v was not held out (from the
baseline forest) and calculate OOB prediction error. Holdout VIMP for
the (feature, block) pair is the difference between these two values.
The final holdout VIMP estimate for a feature v is obtained by averaging
holdout VIMP for (feature=v, block) over all blocks.
To summarize, holdout VIMP measures the importance of a variable when that variable is truly removed from growing the tree.
Accuracy of hold out VIMP depends critically on total number of trees.
If total number of trees is too small, then number of times a variable
is held out will be small and OOB error can suffer from high variance.
Therefore, ntree should be set fairly high - we recommend using
1000 times the number of features. Increasing vtry is another
way to increase number of times a variable is held out and therefore
reduces the burden of growing a large number of trees. In particular,
total number of trees needed decreases linearly with vtry. The
default ntree equals 1000 trees for each feature divided by
vtry. Keep in mind intrepretation of holdout VIMP is altered
when vtry is different than one. Thus this option should be
used with caution.
Invisibly a list with the following components (which themselves can be lists):
importance |
Holdout VIMP. |
baseline |
Prediction error for the baseline forest. |
holdout |
Prediction error for the holdout forest. |
Hemant Ishwaran and Udaya B. Kogalur
Lu M. and Ishwaran H. (2018). Expert Opinion: A prediction-based alternative to p-values in regression models. J. Thoracic and Cardiovascular Surgery, 155(3), 1130–1136.
## ------------------------------------------------------------
## regression analysis
## ------------------------------------------------------------
## new York air quality measurements
airq.obj <- holdout.vimp(Ozone ~ ., data = airquality, na.action = "na.impute")
print(airq.obj$importance)
## ------------------------------------------------------------
## classification analysis
## ------------------------------------------------------------
## iris data
iris.obj <- holdout.vimp(Species ~., data = iris)
print(iris.obj$importance)
## ------------------------------------------------------------
## imbalanced classification (using RFQ)
## ------------------------------------------------------------
if (library("caret", logical.return = TRUE)) {
## experimental settings
n <- 400
q <- 20
ir <- 6
f <- as.formula(Class ~ .)
## simulate the data, create minority class data
d <- twoClassSim(n, linearVars = 15, noiseVars = q)
d$Class <- factor(as.numeric(d$Class) - 1)
idx.0 <- which(d$Class == 0)
idx.1 <- sample(which(d$Class == 1), sum(d$Class == 1) / ir , replace = FALSE)
d <- d[c(idx.0,idx.1),, drop = FALSE]
## VIMP for RFQ with and without blocking
vmp1 <- imbalanced(f, d, importance = TRUE, block.size = 1)$importance[, 1]
vmp10 <- imbalanced(f, d, importance = TRUE, block.size = 10)$importance[, 1]
## holdout VIMP for RFQ with and without blocking
hvmp1 <- holdout.vimp(f, d, rfq = TRUE,
perf.type = "g.mean", block.size = 1)$importance[, 1]
hvmp10 <- holdout.vimp(f, d, rfq = TRUE,
perf.type = "g.mean", block.size = 10)$importance[, 1]
## compare VIMP values
imp <- 100 * cbind(vmp1, vmp10, hvmp1, hvmp10)
legn <- c("vimp-1", "vimp-10","hvimp-1", "hvimp-10")
colr <- rep(4,20+q)
colr[1:20] <- 2
ylim <- range(c(imp))
nms <- 1:(20+q)
par(mfrow=c(2,2))
barplot(imp[,1],col=colr,las=2,main=legn[1],ylim=ylim,names.arg=nms)
barplot(imp[,2],col=colr,las=2,main=legn[2],ylim=ylim,names.arg=nms)
barplot(imp[,3],col=colr,las=2,main=legn[3],ylim=ylim,names.arg=nms)
barplot(imp[,4],col=colr,las=2,main=legn[4],ylim=ylim,names.arg=nms)
}
## ------------------------------------------------------------
## multivariate regression analysis
## ------------------------------------------------------------
mtcars.mreg <- holdout.vimp(Multivar(mpg, cyl) ~., data = mtcars,
vtry = 3,
block.size = 1,
samptype = "swr",
sampsize = dim(mtcars)[1])
print(mtcars.mreg$importance)
## ------------------------------------------------------------
## mixed outcomes analysis
## ------------------------------------------------------------
mtcars.new <- mtcars
mtcars.new$cyl <- factor(mtcars.new$cyl)
mtcars.new$carb <- factor(mtcars.new$carb, ordered = TRUE)
mtcars.mix <- holdout.vimp(cbind(carb, mpg, cyl) ~., data = mtcars.new,
ntree = 100,
block.size = 2,
vtry = 1)
print(mtcars.mix$importance)
##------------------------------------------------------------
## survival analysis
##------------------------------------------------------------
## Primary biliary cirrhosis (PBC) of the liver
data(pbc, package = "randomForestSRC")
pbc.obj <- holdout.vimp(Surv(days, status) ~ ., pbc,
nsplit = 10,
ntree = 1000,
na.action = "na.impute")
print(pbc.obj$importance)
##------------------------------------------------------------
## competing risks
##------------------------------------------------------------
## WIHS analysis
## cumulative incidence function (CIF) for HAART and AIDS stratified by IDU
data(wihs, package = "randomForestSRC")
wihs.obj <- holdout.vimp(Surv(time, status) ~ ., wihs,
nsplit = 3,
ntree = 100)
print(wihs.obj$importance)