| hard_group_by {disk.frame} | R Documentation |
A hard_group_by is a group by that also reorganizes the chunks to ensure that every unique grouping of 'by“ is in the same chunk. Or in other words, every row that share the same 'by' value will end up in the same chunk.
hard_group_by(df, ..., .add = FALSE, .drop = FALSE)
## S3 method for class 'data.frame'
hard_group_by(df, ..., .add = FALSE, .drop = FALSE)
## S3 method for class 'disk.frame'
hard_group_by(
df,
...,
outdir = tempfile("tmp_disk_frame_hard_group_by"),
nchunks = disk.frame::nchunks(df),
overwrite = TRUE,
shardby_function = "hash",
sort_splits = NULL,
desc_vars = NULL,
sort_split_sample_size = 100
)
df |
a disk.frame |
... |
grouping variables |
.add |
same as dplyr::group_by |
.drop |
same as dplyr::group_by |
outdir |
the output directory |
nchunks |
The number of chunks in the output. Defaults = nchunks.disk.frame(df) |
overwrite |
overwrite the out put directory |
shardby_function |
splitting of chunks: "hash" for hash function or "sort" for semi-sorted chunks |
sort_splits |
for the "sort" shardby function, a dataframe with the split values. |
desc_vars |
for the "sort" shardby function, the variables to sort descending. |
sort_split_sample_size |
for the "sort" shardby function, if sort_splits is null, the number of rows to sample per chunk for random splits. |
iris.df = as.disk.frame(iris, nchunks = 2) # group_by iris.df by specifies and ensure rows with the same specifies are in the same chunk iris_hard.df = hard_group_by(iris.df, Species) get_chunk(iris_hard.df, 1) get_chunk(iris_hard.df, 2) # clean up cars.df delete(iris.df) delete(iris_hard.df)