🕌

Thinking about LightGBM from principle

2024/05/26に公開

1. LightGBM

LightGBM(Light Gradient Boosting Machine) is a gradient boosting framework that uses tree-based learning algorithms.
It provide efficient and high prefromance prediction, it is often used in data analysis competitions such as Kaggle.

2. Flow

2.1 Overview

LGBM adopts gradient boosting that is an ensemble learning tecnique where multiple models(weak learners) are trained sequentially. Each new model aims to correct the errors of the previous one.
In other words, the final model is weighted ensemble form all of previous models(weak learner) that weights are considered by only learning rate.

2.2 Initialization

The process starts with an initial model, often a simple one, such as predicting the mean of the target variable for regression tasks or the log-odds for binary classification

2.3 Iterative Process

In each iteration, a new model (typically a decision tree) is trained to minimize the residual errors.

For example, now we assuming the first model is F_1(x), input x is [4,4,4,4], and target y is [3, 7, 2, 6].
First, if the F(x) outputs input as is [4,4,4,4], the residual errors will be [-1, 3, -1, 2]. Then a new tree model h(x) called weak learner is created and training for predict this errors by using x as input, like should be h(x) = [-1, 3, -1, 2] (errors depend on Loss function).
Second, new model(updated model) is defined as F_1(x) = F_0(x) + \alpha h(x). \alpha is learning late to adjust the learning.
Finally, LGBM get the final model with iterating this procces.

2.5 Leaf-wise Growth

LightGBM grows trees leaf-wise, meaning it splits the leaf with the highest loss reduction first. This approach can lead to more complex trees that better fit the data, improving accuracy but potentially increasing the risk of overfitting.
This shows that it is possible to grow deeply out of step with other leaves.
Max Depth: To control overfitting, LightGBM allows setting a maximum depth for trees.

2.6 Gradient-based One-Side Sampling (GOSS)

The main idea behind GOSS is to reduce the computational burden of training by selecting a subset of data points that are most informative for the model. This is done by leveraging the gradients, which are indicators of the prediction error for each instance.

  1. Gradients and Error: In gradient boosting, the gradient of the loss function with respect to the model's predictions indicates how much the prediction should be adjusted. Instances with larger gradients are those where the model is making larger errors and thus need more attention.

  2. Two-Step Sampling:
    Retain High-Gradient Instances: GOSS retains all instances with large gradients. These instances are more difficult for the model to predict correctly and thus provide valuable information for improving the model.
    Randomly Sample Low-Gradient Instances: For instances with small gradients (where the model is already performing well), GOSS randomly samples a subset. This reduces the number of easy-to-predict instances, thereby decreasing the dataset size without losing too much information.

  3. Reweighting: After sampling, the instances with low gradients are reweighted to compensate for their reduced representation in the training data. This ensures that the model remains balanced and does not overfit to the high-gradient instances.

  • Example
    Suppose you have a dataset with 10,000 instances. Using GOSS, you might decide to retain 20% of the instances with the highest gradients (2,000 instances) and randomly sample 10% of the remaining instances with low gradients (800 out of 8,000 instances). This would reduce your training set to 2,800 instances, significantly cutting down the computational cost while still maintaining the critical information needed for effective learning.

2.7 Exclusive Feature Bundling (EFB)

The primary goal of EFB is to reduce the number of features in datasets that are high-dimensional and sparse. Sparse data typically contains many features with a majority of zero values, which can be computationally expensive to process. EFB addresses this by bundling mutually exclusive features into a single feature.

  1. High-Dimensional Sparse Data: In datasets with many features, most features might be zero for any given instance. This sparsity can slow down training and increase memory usage.

  2. Mutually Exclusive Features: Features are considered mutually exclusive if they do not take non-zero values simultaneously for any instance. For example, in a dataset of categorical variables, one-hot encoded features representing different categories of the same variable are mutually exclusive because only one category can be active (non-zero) at a time.

  3. Bundling Process:
    Identify Mutually Exclusive Features: EFB first identifies sets of features that are mutually exclusive.
    Bundle Features: These mutually exclusive features are then bundled into a single composite feature. This means that instead of representing each mutually exclusive feature separately, they are combined into one feature. This reduces the total number of features in the dataset.

  4. Improving Computational Efficiency:
    By bundling features, EFB reduces the dimensionality of the dataset, leading to faster computation and lower memory usage during training.
    The bundling process ensures that the information carried by the original features is not lost, as only mutually exclusive features (which do not overlap in their non-zero values) are bundled.

  • Example
    Original Table
Customer ID Credit Card PayPal Bank Transfer
1 1 0 0
2 0 1 0
3 0 0 1
4 1 0 0
5 0 1 0

Bundled Table

Customer ID Payment Method
1 1
2 2
3 3
4 1
5 2

Consider a dataset with 10,000 features, many of which are sparse and mutually exclusive. Using EFB, you might be able to bundle these into 2,000 composite features. This reduction in the number of features makes the dataset easier to work with, speeding up the training process and reducing the computational resources required.

2.8 Regularization

LightGBM provides L1 and L2 regularization to avoid overfitting.

2.9 Other Features

  • Histogram-based Decision Tree Learning: LightGBM uses histograms to speed up training.
  • Categorical Feature Handling: It can handle categorical features without the need for one-hot encoding.
  • Advantages of LightGBM:
    Efficiency: It is designed to be fast and memory-efficient.
    Accuracy: It often achieves higher accuracy due to its leaf-wise growth strategy.
    Scalability: It can handle large datasets and high-dimensional data.
    Flexibility: It offers various hyperparameters to tune for different datasets and tasks.

Example

It is easy to treat LGBM because python library provide much support to train and prediction, we can realize it in a few code.

・Example

import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the dataset into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Prepare the datasets for LightGBM
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

# Configure the parameters for the LightGBM model
params = {
    'objective': 'binary',              # Objective: binary classification
    'metric': 'binary_logloss',         # Evaluation metric: binary logarithmic loss
    'boosting_type': 'gbdt',            # Boosting type: gradient boosting decision tree
    'num_leaves': 31,                   # Maximum number of leaves per tree
    'learning_rate': 0.05,              # Learning rate for weight adjustment
    'feature_fraction': 0.9,            # Fraction of features used during training
    'bagging_fraction': 0.8,            # Fraction of data to use for each iteration
    'bagging_freq': 5,                  # Bagging frequency
    'lambda_l1': 0.1,                   # L1 regularization term
    'lambda_l2': 0.2,                   # L2 regularization term
    'max_depth': 10                     # Maximum tree depth
}

# Train the LightGBM model
# Train model
bst = lgb.train(
    params, 
    train_data, 
    valid_sets=[test_data], 
    num_boost_round=1000,  # Define a high number for training iterations
    callbacks=[lgb.early_stopping(stopping_rounds=10)]
)


# Make predictions on the test set
y_pred = bst.predict(X_test, num_iteration=bst.best_iteration)
y_pred_binary = [1 if x > 0.5 else 0 for x in y_pred]  # Convert probabilities to binary output

# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred_binary)
print(f'Accuracy: {accuracy}')

Summary

This time, I explained about LGBM from priciple.
Thank you for reading.

Discussion