| stratified {splitstackshape} | R Documentation |
The stratified function samples from a
data.frame or a data.table in which one or more columns can be used as a
"stratification" or "grouping" variable. The result is a new
data.table with the specified number of samples from each group.
stratified(indt, group, size, select = NULL, replace = FALSE, keep.rownames = FALSE, bothSets = FALSE, ...)
indt |
The input |
group |
The column or columns that should be used to create the groups. Can be a character vector of column names (recommended) or a numeric vector of column positions. Generally, if you are using more than one variable to create your "strata", you should list them in the order of slowest varying to quickest varying. This can be a vector of names or column indexes. |
size |
The desired sample size.
|
select |
A named list containing levels from the "group" variables in which you are interested. The list names must be present as variable names for the input dataset. |
replace |
Logical. Should sampling be with replacement? Defaults to |
keep.rownames |
Logical. If the input is a |
bothSets |
Logical. Should both the sampled and non-sampled sets be returned as a |
... |
Optional arguments to |
If bothSets = FALSE, a list of two data.tables; otherwise, a data.table.
Slightly different sizes than requested: Because of how computers deal with floating-point arithmetic, and because R uses a "round to even" approach, the size per strata that results when specifying a proportionate sample may be slightly higher or lower per strata than you might have expected.
Ananda Mahto
strata from the "strata" package; sample_n and sample_frac from "dplyr".
# Generate a sample data.frame to play with
set.seed(1)
dat1 <- data.frame(ID = 1:100,
A = sample(c("AA", "BB", "CC", "DD", "EE"),
100, replace = TRUE),
B = rnorm(100), C = abs(round(rnorm(100), digits=1)),
D = sample(c("CA", "NY", "TX"), 100, replace = TRUE),
E = sample(c("M", "F"), 100, replace = TRUE))
# Let's take a 10% sample from all -A- groups in dat1
stratified(dat1, "A", .1)
# Let's take a 10% sample from only "AA" and "BB" groups from -A- in dat1
stratified(dat1, "A", .1, select = list(A = c("AA", "BB")))
# Let's take 5 samples from all -D- groups in dat1,
# specified by column number
stratified(dat1, group = 5, size = 5)
# Use a two-column strata: -E- and -D-
# -E- varies more slowly, so it is better to put that first
stratified(dat1, c("E", "D"), size = .15)
# Use a two-column strata (-E- and -D-) but only interested in
# cases where -E- == "M"
stratified(dat1, c("E", "D"), .15, select = list(E = "M"))
## As above, but where -E- == "M" and -D- == "CA" or "TX"
stratified(dat1, c("E", "D"), .15,
select = list(E = "M", D = c("CA", "TX")))
# Use a three-column strata: -E-, -D-, and -A-
s.out <- stratified(dat1, c("E", "D", "A"), size = 2)