🦊

【ML】How to use the CatBoost

2024/08/10に公開

1. CatBoost

CatBoost is one of the decision tree algorithms.

1.1 Advantages

Here are the advantages of CatBoost. In summary, it can handle categorical features as is, and has more robustness than the other GBDT(ex. lgbm, xgboost)

  1. Native Handling of Categorical Features
    CatBoost can process categorical data directly without needing preprocessing like one-hot encoding, making it highly efficient for datasets with many categorical variables.

  2. Reduced Overfitting:
    CatBoost’s ordered boosting technique helps in reducing overfitting, leading to better generalization, especially on small datasets.

  3. Less Need for Hyperparameter Tuning
    CatBoost is more robust to hyperparameter settings and often performs well out-of-the-box with minimal tuning.

1.2 Disadvantages

Inversely, the disadvantages of the CatBoost are here.

  1. Slower Training on Large Datasets
    CatBoost can be slower to train on very large datasets compared to LightGBM and XGBoost, especially when the dataset has a large number of numerical features.

  2. Less Mature Ecosystem
    CatBoost's ecosystem is less mature than XGBoost and LightGBM, which means it has fewer integrations, third-party tools, and community support.

  3. Higher Memory Usage
    CatBoost tends to use more memory during training compared to LightGBM and XGBoost, which can be a concern when working with large datasets or on machines with limited resources.

Additionally, all of GBDT contains a risk of overfitting.

2. Code

I'll explain with code from here.

2.1 Step

  1. Create Dataset
    Prepare the train and valid dataset (+test dataset)
  2. Set Parameters
    Set parameters of train and configlation of model and evaluate.
  3. Train
    Train the model with some hyperparameters. Please try the optimization of hyperprams by optuna(auto) or wandb(manual).
  4. Predict and Postprocess
    Predict by model.predict_proba(X_test). This time, using same dataset to valid and test, but typically, they should be separated(because using fold spilit).
    After prediction, calculate score.
  5. (option) Show the Importance
    GBDT can show the importance of features. This indicate us what featture is important and we can use this information for another models like NN. This is so useful, I explain below about more detail settings.

Well, let's see the code. The below is an example code.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from catboost import CatBoostClassifier, Pool, cv

# Load the breast cancer dataset
data = load_breast_cancer()

X = data.data
y = data.target

# Feature names (optional, useful for plotting and interpretation)
feature_names = data.feature_names

df = pd.DataFrame(X, columns=feature_names)
df['target'] = y
display(df.head())

# Split data
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['target']), df['target'], test_size=0.2, random_state=42)

# Set parameters for CatBoost
params = {
    'iterations': 100,             # Number of boosting rounds
    'learning_rate': 0.1,          # Learning rate
    'depth': 6,                    # Depth of each tree
    'loss_function': 'Logloss',    # Loss function for binary classification
    'eval_metric': 'Accuracy',     # Evaluation metric
    'random_seed': 42,             # Random seed for reproducibility
    'logging_level': 'Silent',     # Logging level
    'early_stopping_rounds': 10,   # Early stopping rounds
}

# Convert the data to CatBoost Pool format
train_pool = Pool(X_train, y_train)
valid_pool = Pool(X_test, y_test)

# Initialize and train the model
model = CatBoostClassifier(**params)
model.fit(train_pool, eval_set=valid_pool, verbose=False, plot=True)

# Predict probabilities
y_pred_prob = model.predict_proba(X_test)[:, 1]
# Convert probabilities to binary outputs
y_pred = [1 if prob > 0.5 else 0 for prob in y_pred_prob]

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy on test set: {accuracy * 100:.2f}%')


# Plot feature importance
pve_importances = model.get_feature_importance(type='PredictionValuesChange') # default type
sorted_indices = np.argsort(pve_importances)
plt.figure(figsize=(10, 6))
plt.barh(np.array(feature_names)[sorted_indices], pve_importances[sorted_indices])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Sorted Prediction Values Change (PVE) Feature Importance')
plt.show()


# shap_values will have a shape (n_samples, n_features + 1), where the last column is the bias term
shap_values = model.get_feature_importance(type='ShapValues')
shap_importance = np.mean(np.abs(shap_values[:, :-1]), axis=0)
sorted_indices = np.argsort(shap_importance)
plt.figure(figsize=(10, 6))
plt.barh(np.array(feature_names)[sorted_indices], shap_importance[sorted_indices])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Sorted Shapley Values Feature Importance')
plt.show()


# Optionally, perform cross-validation
cv_data = cv(
    params=params,
    pool=train_pool,
    fold_count=5,
    plot=True
)

# Display cross-validation results
print(f"CV mean accuracy: {cv_data['test-Accuracy-mean'].max():.2f}%")

・Result

  • table
  • log
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
Accuracy on test set: 96.49%
  • Importance
    CatBoost has two option of showing importance "PVC" and "Shapley".

    • PVC calculates the change in the model's prediction when a feature is excluded. It essentially measures how much a particular feature contributes to the model's output in terms of the difference in predictions with and without that feature. It is computed over all instances in the dataset, and the importance is averaged to give a global view of feature importance.
      Higher PVC values indicate that the feature significantly changes the model's prediction when included, suggesting it's an important feature.
    • Shapley values provide a detailed and fair attribution of the model's output to each feature, considering all possible combinations of features. They take into account the interaction between features by assessing the marginal contribution of each feature across all possible subsets. Shapley values are computationally expensive because they require evaluating the model’s predictions over all possible feature subsets. This complexity makes them slower to calculate, especially for large datasets or models with many features.


  • Logloss/Accuracy

  • Logloss/Accuracy(CV)

3. Summary

This time, I explained about CatBoost, I'm planning to write the article about another GBDT method like LGBM and XGBoost. I'd be happy if you read that too.

Discussion