🐆

【ML Method】Gradient Boosting(GB) Explained

2024/07/31に公開

This time, I'll explain about gradient boosting. I also exlpained about AdaBoost in before article, so please try to check if you didin't read it.

1. Gradient Boosting

・GradientBoosting
Gradient Boosting is an ensemble machine learning technique used for regression and classification tasks. It builds a model in a stage-wise fashion from weak learners, typically decision trees, and combines them to form a strong predictor.

2. How Gradient Boosting Works

2.1 Initialozation

The model's workflow starts from initialization because gradient boosting model uses before model's prediction all time.
For a regression problem, this could be the mean value of the target variable. For a classification problem, it could be the log odds (logarithm of the probability divided by one minus the probability).

2.2 Iterative Process

Gradient Boosting proceeds iteratively by adding one weak learner (e.g., a decision tree) at each step. Here’s a breakdown of each iteration.

Please note to there are 2 models.

  1. main model: constructed from any number of weak leaners, this makes predictions(\hat{y})
  2. weak leaner: typically decision tree, aims to predict the error of the previous model.
step1: Compute Residuals(negative gradient)

The residual represents the error of the current model.

・Formula
\text{Residuals} = y - \hat{y}
y: true values
\hat{y}: predicted values

Sometime, loss functions are used for compute residuals.

step2: Train New Weak Leaner

Train a new weak learner to predict the residuals. This new model tries to capture the errors made by the previous model.

# pseudo
h_m = weak_leaner.fit(inputs=inputs, y=residuals)

h_m is a new_weak_leaner.

step3: Update the model

Update the model by adding the new weak learner. The learning rate \alpha (a small positive value) is used to scale the contribution of the new model to control overfitting.

・Formula
F_m = F_{m-1} + \alpha h_m
F_m: new model
F_{m-1}: previous model
h_m: new weak leaner

Gradient Boosting will repeat these three steps for a specified number of iterations or until the error reduces sufficiently.

2.3 Learning Rate and Countermeasures to overfitting

・Learning rate
A smaller learning rate requires more iterations but can lead to better generalization.

・Countermeasures to overfitting
Regularization techniques, such as limiting the depth of the trees, adding randomness (like subsampling the data or features), and using different loss functions, are used to prevent overfitting.

The choice of the loss function depends on the problem:
・Regression: Mean Squared Error (MSE) or Mean Absolute Error (MAE)
・Classification: Log-loss or exponential loss

2.4 Look Back whole Processing

Look back with pseudo code.

# Define first model
F = model # always return mean(y)

for m in (1 to M):
    # Compute residuals
    residuals = y - F(inputs)
    # train weak learner
    h = weak_learner.fit(inputs=inputs, targets=residuals)
    # Update model
    F = F + alpha * h

final_model = F

3. Features

3.1 Advanteges

・Accuracy: often yields high performance
・Handles Missing Data: Can handle missing data without requiring imputation.

3.2 Disadvantages

・Overfitting: If not properly regularized, it can overfit the training data.

3.3 Compare to AdaBoost

Adaboost

Advantages:
・Robustness to Overfitting
Disadvantages:
・Sensitive to Noisy Data and Outliers
・Computationally Intensive

Adaboost ulitizes the ensemble from weak learner that weights was adapted from previous model's loss, in other hands, Gradient Boosting utilizes stacked weak leaner that trained to predict previous model's loss.

Impression:
Based on these mechanism, it seems to me that Gradient Boosting optimizes a model for a single task, while AdaBoost improves performance while making it compatible with a variety of tasks.

The differences between these models seem to be reflected in the characteristics of each model.

4. Summary

This time, I explained aboout Gradient Boosting from mechanism. The Gradient Boosting can perform high accuracy without missing processing, but may occurs overfitting if not propery redularized.

Use case:

  • Gradient Boosting
    When want to high accurary, enough diversity in dataset, contain many missing values.
  • Adaboost
    When dataset is clear(less noisy data and outliners), have certain computational resources, wanna prevent overfitting.

Each model has its advantages and disadvantages, so it is important to use them appropriately.

This article overs, thank you for reading.

Discussion