hard_group_by {disk.frame}R Documentation

Perform a hard group

Description

A hard_group_by is a group by that also reorganizes the chunks to ensure that every unique grouping of 'by“ is in the same chunk. Or in other words, every row that share the same 'by' value will end up in the same chunk.

Usage

hard_group_by(df, ..., .add = FALSE, .drop = FALSE)

## S3 method for class 'data.frame'
hard_group_by(df, ..., .add = FALSE, .drop = FALSE)

## S3 method for class 'disk.frame'
hard_group_by(
  df,
  ...,
  outdir = tempfile("tmp_disk_frame_hard_group_by"),
  nchunks = disk.frame::nchunks(df),
  overwrite = TRUE,
  shardby_function = "hash",
  sort_splits = NULL,
  desc_vars = NULL,
  sort_split_sample_size = 100
)

Arguments

df

a disk.frame

...

grouping variables

.add

same as dplyr::group_by

.drop

same as dplyr::group_by

outdir

the output directory

nchunks

The number of chunks in the output. Defaults = nchunks.disk.frame(df)

overwrite

overwrite the out put directory

shardby_function

splitting of chunks: "hash" for hash function or "sort" for semi-sorted chunks

sort_splits

for the "sort" shardby function, a dataframe with the split values.

desc_vars

for the "sort" shardby function, the variables to sort descending.

sort_split_sample_size

for the "sort" shardby function, if sort_splits is null, the number of rows to sample per chunk for random splits.

Examples

iris.df = as.disk.frame(iris, nchunks = 2)

# group_by iris.df by specifies and ensure rows with the same specifies are in the same chunk
iris_hard.df = hard_group_by(iris.df, Species)

get_chunk(iris_hard.df, 1)
get_chunk(iris_hard.df, 2)

# clean up cars.df
delete(iris.df)
delete(iris_hard.df)

[Package disk.frame version 0.3.7 Index]