catboost.load_pool {catboost}R Documentation

Create a dataset

Description

Create a dataset from the given file, matrix or data.frame.

Usage

catboost.load_pool(data, label = NULL, cat_features = NULL,
  column_description = NULL, pairs = NULL, delimiter = "\t",
  has_header = FALSE, weight = NULL, group_id = NULL,
  group_weight = NULL, subgroup_id = NULL, pairs_weight = NULL,
  baseline = NULL, feature_names = NULL, thread_count = -1)

Arguments

data

A file path, matrix or data.frame with features. The following column types are supported:

  • double

  • factor. It is assumed that categorical features are given in this type of columns. A standard CatBoost processing procedure is applied to this type of columns:

    1.

    The values are converted to strings.

    2.

    The ConvertCatFeatureToFloat function is applied to the resulting string.

Default value: Required argument

label

The label vector.

cat_features

A vector of categorical features indices. The indices are zero based and can differ from the given in the Column descriptions file.

column_description

The path to the input file that contains the column descriptions.

pairs

A file path, matrix or data.frame that contains the pairs descriptions. The shape should be Nx2, where N is the pairs' count. The first element of pair is the index of winner document in training set. The second element of pair is the index of loser document in training set.

delimiter

Delimiter character to use to separate features in a file.

has_header

Read column names from first line, if this parameter is set to True.

weight

The weights of the objects.

group_id

The group ids of the objects.

group_weight

The group weight of the objects.

subgroup_id

The subgroup ids of the objects.

pairs_weight

The weights of the pairs.

baseline

Vector of initial (raw) values of the objective function. Used in the calculation of final values of trees.

feature_names

A list of names for each feature in the dataset.

thread_count

The number of threads to use while reading the data. Optimizes reading time. This parameter doesn't affect results. If -1, then the number of threads is set to the number of CPU cores.

Value

catboost.Pool

Examples

# From file
pool_path <- system.file("extdata", "adult_train.1000", package = "catboost")
cd_path <- system.file("extdata", "adult.cd", package = "catboost")
pool <- catboost.load_pool(pool_path, column_description = cd_path)
head(pool)

# From matrix
target <- 1
data_matrix <-matrix(runif(18), 6, 3)
pool <- catboost.load_pool(data_matrix[, -target], label = data_matrix[, target])
head(pool)

# From data.frame
nonsense <- c('A', 'B', 'C')
data_frame <- data.frame(value = runif(10), category = nonsense[(1:10) %% 3 + 1])
label = (1:10) %% 2
pool <- catboost.load_pool(data_frame, label = label, cat_features = c(2))
head(pool)


[Package catboost version 0.17.2 Index]