| catboost.train {catboost} | R Documentation |
Train the model using a CatBoost dataset.
catboost.train(learn_pool, test_pool = NULL, params = list())
learn_pool |
The dataset used for training the model. Default value: Required argument |
test_pool |
The dataset used for testing the quality of the model. Default value: NULL (not used) |
params |
The list of parameters to start training with. If omitted, default values are used (see The list of parameters). If set, the passed list of parameters overrides the default values. Default value: Required argument |
The list of parameters
Common parameters
fold_permutation_block
Objects in the dataset are grouped in blocks before the random permutations. This parameter defines the size of the blocks. The smaller is the value, the slower is the training. Large values may result in quality degradation.
Default value:
Default value differs depending on the dataset size and ranges from 1 to 256 inclusively
ignored_features
Identifiers of features to exclude from training. The non-negative indices that do not match any features are successfully ignored. For example, if five features are defined for the objects in the dataset and this parameter is set to "42", the corresponding non-existing feature is successfully ignored.
The identifier corresponds to the feature's index.
Feature indices used in train and feature importance are numbered from 0 to featureCount-1.
If a file is used as input data then any non-feature column types are ignored when calculating these
indices. For example, each row in the input file contains data in the following order:
"categorical feature<\t>label<\t>numerical feature". So for the row "rock<\t>0<\t>42",
the identifier for the "rock" feature is 0, and for the "42" feature it is 1.
The identifiers of features to exclude should be enumerated at vector.
For example, if training should exclude features with the identifiers 1, 2, 7, 42, 43, 44, 45, the value of this parameter should be set to c(1,2,7,42,43,44,45).
Default value:
None (use all features)
use_best_model
If this parameter is set, the number of trees that are saved in the resulting model is defined as follows:
Build the number of trees defined by the training parameters.
Identify the iteration with the optimal loss function value.
No trees are saved after this iteration.
This option requires a test dataset to be provided.
Default value:
FALSE (not used)
loss_function
The loss function (see https://catboost.ai/docs/concepts/loss-functions.html#loss-functions) to use in training. The specified value also determines the machine learning problem to solve.
Format:
<Loss function 1>[:<parameter 1>=<value>:..<parameter N>=<value>:]
Supported loss functions:
'Logloss'
'CrossEntropy'
'MultiClass'
'MultiClassOneVsAll'
'RMSE'
'MAE'
'Quantile'
'LogLinQuantile'
'MAPE'
'Poisson'
'Lq'
'PairLogit'
'PairLogitPairwise'
'YetiRank'
'YetiRankPairwise'
'QueryCrossEntropy'
'QueryRMSE'
'QuerySoftMax'
Supported parameters:
alpha - The coefficient used in quantile-based losses ('Quantile' and 'LogLinQuantile'). The default value is 0.5.
For example, if you need to calculate the value of Quantile with the coefficient α = 0.1, use the following construction:
'Quantile:alpha=0.1'
Default value:
'RMSE'
custom_loss
Loss function (see https://catboost.ai/docs/concepts/loss-functions.html#loss-functions) values to output during training. These functions are not used for optimization and are displayed for informational purposes only.
Format:
c(<Loss function 1>[:<parameter>=<value>],<Loss function 2>[:<parameter>=<value>],...,<Loss function N>[:<parameter>=<value>])
Supported loss functions:
'Logloss'
'CrossEntropy'
'Precision'
'Recall'
'F1'
'BalancedAccuracy'
'BalancedErrorRate'
'MCC'
'Accuracy'
'CtrFactor'
'AUC'
'BrierScore'
'HingeLoss'
'HammingLoss'
'ZeroOneLoss'
'Kappa'
'WKappa'
'LogLikelihoodOfPrediction'
'MultiClass'
'MultiClassOneVsAll'
'TotalF1'
'MAE'
'MAPE'
'Poisson'
'Quantile'
'RMSE'
'LogLinQuantile'
'Lq'
'NumErrors'
'SMAPE'
'R2'
'MSLE'
'MedianAbsoluteError'
'PairLogit'
'PairLogitPairwise'
'PairAccuracy'
'QueryCrossEntropy'
'QueryRMSE'
'QuerySoftMax'
'PFound'
'NDCG'
'AverageGain'
'PrecisionAt'
'RecallAt'
'MAP'
Supported parameters:
alpha - The coefficient used in quantile-based losses ('Quantile' and 'LogLinQuantile'). The default value is 0.5.
For example, if you need to calculate the value of CrossEntropy and Quantile with the coefficient α = 0.1, use the following construction:
c('CrossEntropy') or simply 'CrossEntropy'.
Values of all custom loss functions for learning and test datasets are saved to the Loss function (see https://catboost.ai/docs/concepts/output-data_loss-function.html#output-data_loss-function) output files (learn_error.tsv and test_error.tsv respectively). The catalog for these files is specified in the train-dir (train_dir) parameter.
Default value:
None (use one of the loss functions supported by the library)
eval_metric
The loss function used for overfitting detection (if enabled) and best model selection (if enabled).
Supported loss functions:
'Logloss'
'CrossEntropy'
'Precision'
'Recall'
'F1'
'BalancedAccuracy'
'BalancedErrorRate'
'MCC'
'Accuracy'
'CtrFactor'
'AUC'
'BrierScore'
'HingeLoss'
'HammingLoss'
'ZeroOneLoss'
'Kappa'
'WKappa'
'LogLikelihoodOfPrediction'
'MultiClass'
'MultiClassOneVsAll'
'TotalF1'
'MAE'
'MAPE'
'Poisson'
'Quantile'
'RMSE'
'LogLinQuantile'
'Lq'
'NumErrors'
'SMAPE'
'R2'
'MSLE'
'MedianAbsoluteError'
'PairLogit'
'PairLogitPairwise'
'PairAccuracy'
'QueryCrossEntropy'
'QueryRMSE'
'QuerySoftMax'
'PFound'
'NDCG'
'AverageGain'
'PrecisionAt'
'RecallAt'
'MAP'
Format:
metric_name:param=Value
Examples:
'R2'
'Quantile:alpha=0.3'
Default value:
Optimized objective is used
iterations
The maximum number of trees that can be built when solving machine learning problems.
When using other parameters that limit the number of iterations, the final number of trees may be less than the number specified in this parameter.
Default value:
1000
border
The target border. If the value is strictly greater than this threshold, it is considered a positive class. Otherwise it is considered a negative class.
The parameter is obligatory if the Logloss function is used, since it uses borders to transform any given target to a binary target.
Used in binary classification.
Default value:
0.5
leaf_estimation_iterations
The number of gradient steps when calculating the values in leaves.
Default value:
1
depth
Depth of the tree.
The value can be any integer up to 16. It is recommended to use values in the range [1; 10].
Default value:
6
learning_rate
The learning rate.
Used for reducing the gradient step.
Default value:
0.03
rsm
Random subspace method. The percentage of features to use at each iteration of building trees. At each iteration, features are selected over again at random.
The value must be in the range [0;1].
Default value:
1
random_seed
The random seed used for training.
Default value:
0
nan_mode
Way to process missing values.
Possible values:
'Min'
'Max'
'Forbidden'
Default value:
'Min'
od_pval
Use the Overfitting detector (see https://catboost.ai/docs/concepts/overfitting-detector.html#overfitting-detector) to stop training when the threshold is reached. Requires that a test dataset was input.
For best results, it is recommended to set a value in the range [10^-10; 10^-2].
The larger the value, the earlier overfitting is detected.
Default value:
The overfitting detection is turned off
od_type
The method used to calculate the values in leaves.
Possible values:
IncToDec
Iter
Restriction. Do not specify the overfitting detector threshold when using the Iter type.
Default value:
'IncToDec'
od_wait
The number of iterations to continue the training after the iteration with the optimal loss function value. The purpose of this parameter differs depending on the selected overfitting detector type:
IncToDec - Ignore the overfitting detector when the threshold is reached and continue learning for the specified number of iterations after the iteration with the optimal loss function value.
Iter - Consider the model overfitted and stop training after the specified number of iterations since the iteration with the optimal loss function value.
Default value:
20
leaf_estimation_method
The method used to calculate the values in leaves.
Possible values:
Newton
Gradient
Default value:
Default value depends on the selected loss function
grow_policy
GPU only. The tree growing policy. It describes how to perform greedy tree construction.
Possible values:
SymmetricTree
Lossguide
Depthwise
Default value:
SymmetricTree
min_data_in_leaf
GPU only. The minimum training samples count in leaf. CatBoost will not search for new splits in leaves with samples count less than min_data_in_leaf. This parameter is used only for Depthwise and Lossguide growing policies.
Default value:
1
max_leaves
GPU only. The maximum leaf count in resulting tree. Used only for Lossguide growing policy. This parameter is used only for Lossguide growing policy.
Default value:
31
score_function GPU only. Score that is used during tree construction to select the next tree split.
Possible values:
L2
Cosine
NewtonL2
NewtonCosine
Default value:
Cosine
For growing policy Lossguide default is NewtonL2.
l2_leaf_reg
L2 regularization coefficient. Used for leaf value calculation.
Any positive values are allowed.
Default value:
3
model_size_reg
Model size regularization coefficient. The influence coefficient of the model size for choosing tree structure. To get a smaller model size - increase this coefficient.
Any positive values are allowed.
Default value:
0.5
has_time
Use the order of objects in the input data (do not perform a random permutation of the dataset at the preprocessing stage)
Default value:
FALSE (not used; permute input dataset)
allow_const_label
To allow the constant label value in the dataset.
Default value:
FALSE
name
The experiment name to display in visualization tools (see https://catboost.ai/docs/features/visualization.html#visualization).
Default value:
experiment
prediction_type
The format for displaying approximated values in output data.
Possible values:
'Probability'
'Class'
'RawFormulaVal'
Default value:
'RawFormulaVal'
fold_len_multiplier
Coefficient for changing the length of folds.
The value must be greater than 1. The best validation result is achieved with minimum values.
With values close to 1 (for example, 1 + ε), each iteration takes a quadratic amount of memory and time for the number of objects in the iteration. Thus, low values are possible only when there is a small number of objects.
Default value:
2
class_weights
Classes weights. The values are used as multipliers for the object weights.
For example, for 3 class classification you could use:
c(0.85, 1.2, 1)
Default value:
None (the weight for all classes is set to 1)
classes_count
The upper limit for the numeric class label. Defines the number of classes for multiclassification.
Only non-negative integers can be specified. The given integer should be greater than any of the target values.
If this parameter is specified the labels for all classes in the input dataset should be smaller than the given value.
Default value:
maximum class label + 1
one_hot_max_size
Convert the feature to float if the number of different values that it takes exceeds the specified value. Ctrs are not calculated for such features.
The one-vs.-all delimiter is used for the resulting float features.
Default value:
FALSE
Do not convert features to float based on the number of different values
random_strength
Score standard deviation multiplier.
Default value:
1
bootstrap_type
Bootstrap type. Defines the method for sampling the weights of documents.
Possible values:
'Bayesian'
'Bernoulli'
'Poisson'
'MVS'
'No'
Poisson bootstrap is supported only on GPU.
Default value:
'Bayesian'
bagging_temperature
Controls intensity of Bayesian bagging. The higher the temperature the more aggressive bagging is.
Typical values are in the range [0, 1] (0 is for no bagging).
Possible values are in the range [0, +∞).
Default value:
1
subsample
Sample rate for bagging. This parameter can be used if one of the following bootstrap types is defined:
'Bernoulli'
Default value:
0.66
sampling_unit
The parameter allows to specify the sampling scheme: sample weights for each object individually or for an entire group of objects together.
Possible values:
'Object'
'Group'
Default value:
'Object'
sampling_frequency
Frequency to sample weights and objects when building trees.
Possible values:
'PerTree'
'PerTreeLevel'
Default value:
'PerTreeLevel'
model_shrink_rate
For i > 0 at the start of i-th iteration multiplies model by (1 - model_shrink_rate / i).
Possible values: [0, 1).
Default value: 0
CTR settings
simple_ctr
Binarization settings for categorical features (see https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html).
Format:
c(CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N])
Components:
CTR types for training on CPU:
'Borders'
'Buckets'
'BinarizedTargetMeanValue'
'Counter'
CTR types for training on GPU:
'Borders'
'Buckets'
'FeatureFreq'
'FloatTargetMeanValue'
The number of borders for label value binarization. (see https://catboost.ai/docs/concepts/quantization.html) Only used for regression problems. Allowed values are integers from 1 to 255 inclusively. The default value is 1. This option is available for training on CPU only.
The binarization (see https://catboost.ai/docs/concepts/quantization.html) type for the label value. Only used for regression problems.
Possible values:
'Median'
'Uniform'
'UniformAndQuantiles'
'MaxLogSum'
'MinEntropy'
'GreedyLogSum'
By default, 'MinEntropy'
This option is available for training on CPU only.
The number of splits for categorical features. Allowed values are integers from 1 to 255 inclusively.
The binarization type for categorical features. Supported values for training on CPU:
'Uniform'
Supported values for training on GPU:
'Median'
'Uniform'
'UniformAndQuantiles'
'MaxLogSum'
'MinEntropy'
'GreedyLogSum'
Priors to use during training (several values can be specified) Possible formats:
'One number - Adds the value to the numerator.'
'Two slash-delimited numbers (for GPU only) - Use this format to set a fraction. The number is added to the numerator and the second is added to the denominator.'
combinations_ctr
Binarization settings for combinations of categorical features (see https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html).
Format:
c(CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N])
Components:
CTR types for training on CPU:
'Borders'
'Buckets'
'BinarizedTargetMeanValue'
'Counter'
CTR types for training on GPU:
'Borders'
'Buckets'
'FeatureFreq'
'FloatTargetMeanValue'
The number of borders for target binarization. (see https://catboost.ai/docs/concepts/quantization.html) Only used for regression problems. Allowed values are integers from 1 to 255 inclusively. The default value is 1. This option is available for training on CPU only.
The binarization (see https://catboost.ai/docs/concepts/quantization.html) type for the target. Only used for regression problems.
Possible values:
'Median'
'Uniform'
'UniformAndQuantiles'
'MaxLogSum'
'MinEntropy'
'GreedyLogSum'
By default, 'MinEntropy'
This option is available for training on CPU only.
The number of splits for categorical features. Allowed values are integers from 1 to 255 inclusively.
The binarization type for categorical features. Supported values for training on CPU:
'Uniform'
Supported values for training on GPU:
'Median'
'Uniform'
'UniformAndQuantiles'
'MaxLogSum'
'MinEntropy'
'GreedyLogSum'
Priors to use during training (several values can be specified) Possible formats:
'One number - Adds the value to the numerator.'
'Two slash-delimited numbers (for GPU only) - Use this format to set a fraction. The number is added to the numerator and the second is added to the denominator.'
ctr_target_border_count
Maximum number of borders used in target binarization for categorical features that need it. If TargetBorderCount is specified in 'simple_ctr', 'combinations_ctr' or 'per_feature_ctr' option it overrides this value.
Default value:
1
counter_calc_method
The method for calculating the Counter CTR type for the test dataset.
Possible values:
'Full'
'FullTest'
'PrefixTest'
'SkipTest'
Default value: 'PrefixTest'
max_ctr_complexity
The maximum number of categorical features that can be combined.
Default value:
4
ctr_leaf_count_limit
The maximum number of leaves with categorical features.
If the number of leaves exceeds the specified limit, some leaves are discarded.
The value must be positive (for zero limit use ignored_features parameter).
The leaves to be discarded are selected as follows:
The leaves are sorted by the frequency of the values.
The top N leaves are selected, where N is the value specified in the parameter.
All leaves starting from N+1 are discarded.
This option reduces the resulting model size and the amount of memory required for training. Note that the resulting quality of the model can be affected.
Default value:
None (The number of leaves with categorical features is not limited)
store_all_simple_ctr
Ignore categorical features, which are not used in feature combinations, when choosing candidates for exclusion.
Use this parameter with ctr-leaf-count-limit only.
Default value:
FALSE (Both simple features and feature combinations are taken in account when limiting the number of leaves with categorical features)
Binarization settings
border_count
The number of splits for numerical features. Allowed values are integers from 1 to 255 inclusively.
Default value:
254 for training on CPU or 128 for training on GPU
feature_border_type
The binarization mode (see https://catboost.ai/docs/concepts/quantization.html) for numerical features.
Possible values:
'Median'
'Uniform'
'UniformAndQuantiles'
'MaxLogSum'
'MinEntropy'
'GreedyLogSum'
Default value:
'MinEntropy'
Performance settings
thread_count
The number of threads to use when applying the model.
Allows you to optimize the speed of execution. This parameter doesn't affect results.
Default value:
The number of CPU cores.
Output settings
logging_level
Possible values:
'Silent'
'Verbose'
'Info'
'Debug'
Default value:
'Silent'
metric_period
The frequency of iterations to print the information to stdout. The value should be a positive integer.
Default value:
1
train_dir
The directory for storing the files generated during training.
Default value:
None (current catalog)
save_snapshot
Enable snapshotting for restoring the training progress after an interruption.
Default value:
None
snapshot_file
Settings for recovering training after an interruption (see https://catboost.ai/docs/features/snapshots.html).
Depending on whether the file specified exists in the file system:
Missing - write information about training progress to the specified file.
Exists - load data from the specified file and continue training from where it left off.
Default value:
File can't be generated or read. If the value is omitted, the file name is experiment.cbsnapshot.
snapshot_interval
Interval between saving snapshots (seconds)
Default value:
600
allow_writing_files
If this flag is set to FALSE, no files with different diagnostic info will be created during training. With this flag set to FALSE no snapshotting can be done. Plus visualisation will not work, because visualisation uses files that are created and updated during training.
Default value:
TRUE
approx_on_full_history
If this flag is set to TRUE, each approximated value is calculated using all the preceeding rows in the fold (slower, more accurate). If this flag is set to FALSE, each approximated value is calculated using only the beginning 1/fold_len_multiplier fraction of the fold (faster, slightly less accurate).
Default value:
FALSE
boosting_type
Boosting scheme. Possible values: - 'Ordered' - Gives better quality, but may slow down the training. - 'Plain' - The classic gradient boosting scheme. May result in quality degradation, but does not slow down the training.
Default value:
Depends on object count and feature count in train dataset and on learning mode.
dev_score_calc_obj_block_size
CPU only. Size of block of samples in score calculation. Should be > 0 Used only for learning speed tuning. Changing this parameter can affect results in pairwise scoring mode due to numerical accuracy differences
Default value:
5000000
dev_efb_max_buckets
CPU only. Maximum bucket count in exclusive features bundle. Should be in an integer between 0 and 65536. Used only for learning speed tuning.
Default value:
1024
sparse_features_conflict_fraction
CPU only. Maximum allowed fraction of conflicting non-default values for features in exclusive features bundle. Should be a real value in [0, 1) interval.
Default value:
0.0
leaf_estimation_backtracking
Type of backtracking during gradient descent. Possible values: - 'No' - never backtrack; supported on CPU and GPU - 'AnyImprovement' - reduce the descent step until the value of loss function is less than before the step; supported on CPU and GPU - 'Armijo' - reduce the descent step until Armijo condition is satisfied; supported on GPU only
Default value:
'AnyImprovement'
https://catboost.ai/docs/concepts/r-reference_catboost-train.html
train_pool_path <- system.file("extdata", "adult_train.1000", package = "catboost")
test_pool_path <- system.file("extdata", "adult_test.1000", package = "catboost")
cd_path <- system.file("extdata", "adult.cd", package = "catboost")
train_pool <- catboost.load_pool(train_pool_path, column_description = cd_path)
test_pool <- catboost.load_pool(test_pool_path, column_description = cd_path)
fit_params <- list(
iterations = 100,
loss_function = 'Logloss',
ignored_features = c(4, 9),
border_count = 32,
depth = 5,
learning_rate = 0.03,
l2_leaf_reg = 3.5,
train_dir = 'train_dir')
model <- catboost.train(train_pool, test_pool, fit_params)