| causal_forest {grf} | R Documentation |
Trains a causal forest that can be used to estimate conditional average treatment effects tau(X). When the treatment assignment W is binary and unconfounded, we have tau(X) = E[Y(1) - Y(0) | X = x], where Y(0) and Y(1) are potential outcomes corresponding to the two possible treatment states. When W is continuous, we effectively estimate an average partial effect Cov[Y, W | X = x] / Var[W | X = x], and interpret it as a treatment effect given unconfoundedness.
causal_forest(X, Y, W, Y.hat = NULL, W.hat = NULL, sample.fraction = 0.5, mtry = NULL, num.trees = 2000, num.threads = NULL, min.node.size = NULL, honesty = TRUE, honesty.fraction = NULL, ci.group.size = 2, alpha = NULL, imbalance.penalty = NULL, stabilize.splits = TRUE, compute.oob.predictions = TRUE, seed = NULL, clusters = NULL, samples_per_cluster = NULL, tune.parameters = FALSE, num.fit.trees = 200, num.fit.reps = 50, num.optimize.reps = 1000)
X |
The covariates used in the causal regression. |
Y |
The outcome. |
W |
The treatment assignment (may be binary or real). |
Y.hat |
Estimates of the expected responses E[Y | Xi], marginalizing over treatment. If Y.hat = NULL, these are estimated using a separate regression forest. See section 6.1.1 of the GRF paper for further discussion of this quantity. |
W.hat |
Estimates of the treatment propensities E[W | Xi]. If W.hat = NULL, these are estimated using a separate regression forest. |
sample.fraction |
Fraction of the data used to build each tree. Note: If honesty = TRUE, these subsamples will further be cut by a factor of honesty.fraction. |
mtry |
Number of variables tried for each split. |
num.trees |
Number of trees grown in the forest. Note: Getting accurate confidence intervals generally requires more trees than getting accurate predictions. |
num.threads |
Number of threads used in training. If set to NULL, the software automatically selects an appropriate amount. |
min.node.size |
A target for the minimum number of observations in each tree leaf. Note that nodes with size smaller than min.node.size can occur, as in the original randomForest package. |
honesty |
Whether to use honest splitting (i.e., sub-sample splitting). |
honesty.fraction |
The fraction of data that will be used for determining splits if honesty = TRUE. Corresponds to set J1 in the notation of the paper. When using the defaults (honesty = TRUE and honesty.fraction = NULL), half of the data will be used for determining splits |
ci.group.size |
The forest will grow ci.group.size trees on each subsample. In order to provide confidence intervals, ci.group.size must be at least 2. |
alpha |
A tuning parameter that controls the maximum imbalance of a split. |
imbalance.penalty |
A tuning parameter that controls how harshly imbalanced splits are penalized. |
stabilize.splits |
Whether or not the treatment should be taken into account when determining the imbalance of a split (experimental). |
compute.oob.predictions |
Whether OOB predictions on training set should be precomputed. |
seed |
The seed of the C++ random number generator. |
clusters |
Vector of integers or factors specifying which cluster each observation corresponds to. |
samples_per_cluster |
If sampling by cluster, the number of observations to be sampled from each cluster when training a tree. If NULL, we set samples_per_cluster to the size of the smallest cluster. If some clusters are smaller than samples_per_cluster, the whole cluster is used every time the cluster is drawn. Note that clusters with less than samples_per_cluster observations get relatively smaller weight than others in training the forest, i.e., the contribution of a given cluster to the final forest scales with the minimum of the number of observations in the cluster and samples_per_cluster. |
tune.parameters |
If true, NULL parameters are tuned by cross-validation; if false NULL parameters are set to defaults. |
num.fit.trees |
The number of trees in each 'mini forest' used to fit the tuning model. |
num.fit.reps |
The number of forests used to fit the tuning model. |
num.optimize.reps |
The number of random parameter values considered when using the model to select the optimal parameters. |
A trained causal forest object.
## Not run:
# Train a causal forest.
n = 50; p = 10
X = matrix(rnorm(n*p), n, p)
W = rbinom(n, 1, 0.5)
Y = pmax(X[,1], 0) * W + X[,2] + pmin(X[,3], 0) + rnorm(n)
c.forest = causal_forest(X, Y, W)
# Predict using the forest.
X.test = matrix(0, 101, p)
X.test[,1] = seq(-2, 2, length.out = 101)
c.pred = predict(c.forest, X.test)
# Predict on out-of-bag training samples.
c.pred = predict(c.forest)
# Predict with confidence intervals; growing more trees is now recommended.
c.forest = causal_forest(X, Y, W, num.trees = 4000)
c.pred = predict(c.forest, X.test, estimate.variance = TRUE)
# In some examples, pre-fitting models for Y and W separately may
# be helpful (e.g., if different models use different covariates).
# In some applications, one may even want to get Y.hat and W.hat
# using a completely different method (e.g., boosting).
n = 2000; p = 20
X = matrix(rnorm(n * p), n, p)
TAU = 1 / (1 + exp(-X[, 3]))
W = rbinom(n, 1, 1 / (1 + exp(-X[, 1] - X[, 2])))
Y = pmax(X[, 2] + X[, 3], 0) + rowMeans(X[, 4:6]) / 2 + W * TAU + rnorm(n)
forest.W = regression_forest(X, W, tune.parameters = TRUE)
W.hat = predict(forest.W)$predictions
forest.Y = regression_forest(X, Y, tune.parameters = TRUE)
Y.hat = predict(forest.Y)$predictions
forest.Y.varimp = variable_importance(forest.Y)
# Note: Forests may have a hard time when trained on very few variables
# (e.g., ncol(X) = 1, 2, or 3). We recommend not being too aggressive
# in selection.
selected.vars = which(forest.Y.varimp / mean(forest.Y.varimp) > 0.2)
tau.forest = causal_forest(X[,selected.vars], Y, W,
W.hat = W.hat, Y.hat = Y.hat,
tune.parameters = TRUE)
tau.hat = predict(tau.forest)$predictions
## End(Not run)