🌺

# 【XGB】XGboost Tutorial with iris dataset

2024/06/26に公開

XGBoost, short for eXtreme Gradient Boosting, is a popular and powerful machine learning algorithm used for both regression and classification tasks.

# 1. How to use

### 1.1 Prepare

If did't install:

``````pip install xgboost
``````

・Import library

``````import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
import numpy as np
``````

・Prepare the data

``````# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
``````

Check the data:

``````display(X_train,y_train)
display(X_test,y_test)

# output
# array([[4.6, 3.6, 1. , 0.2],
#        [5.7, 4.4, 1.5, 0.4],
#        [6.7, 3.1, 4.4, 1.4],
#        [4.8, 3.4, 1.6, 0.2],
#        ...
# array([0, 0, 1, 0, 0, 2, 1, ...
``````

### 1.2 Create a DMatrix

This is option.

XGBoost uses its own data structure called "DMatrix" for optimized performance, so create Dmatrix from X and y:

``````dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
``````

Benefit of DMatrix：
・It designed to handle large datasets more efficiency than numpy arrays or pandas DataFrames.
・Before training, it performs preprocessing such as sorting, binning, and quantizing.
・It allows missing values.
・It optimized for gradient boost algorithm, like calculating gradients and updating weights. Also distributed training is available.

### 1.3 Parameter Setting

Define the parameters for the XGBoost model:

``````params = {
'max_depth': 3,
'eta': 0.3,
'objective': 'multi:softprob',
'num_class': 3
}
``````

I specified it here, but all of parameters have default value.

defaults are here:

1. eta (learning_rate): Controls the learning rate. Lower values require more boosting rounds. Default is 0.3.
2. max_depth: Maximum depth of a tree. Increasing this value makes the model more complex. Default is 6.
3. min_child_weight: Minimum sum of instance weight (hessian) needed in a child. Default is 1.
4. gamma: Minimum loss reduction required to make a further partition on a leaf node. Default is 0.
5. subsample: Fraction of the training data to be used for each tree. Default is 1.
6. colsample_bytree: Fraction of features to be used for each tree. Default is 1.
7. lambda (reg_lambda): L2 regularization term on weights. Default is 1.
8. alpha (reg_alpha): L1 regularization term on weights. Default is 0.
9. objective: Specifies the learning task. Default is 'reg'

So, I reccomend you to learn with defaut value in first, and adjust the value by manually or optimization algorithm such as optuna after once inference.

### 1.4 Training

Train the model by `.train`:

``````num_round = 100
bst = xgb.train(params, dtrain, num_round)
``````

### 1.5 Prediction

Predict from the test data:

``````preds = bst.predict(dtest)
print(f"raw_predictions:\n{preds[:5]}...")
best_preds = np.asarray([np.argmax(line) for line in preds])
print(f"Extract index of prediction that have highest prediction:\n{best_preds[:5]}...")

# output
# raw_predictions:
# [[2.75319722e-03 9.83841240e-01 1.34055475e-02]
#  [9.86347795e-01 1.29841845e-02 6.67976623e-04]
#  [4.21463046e-04 1.04176207e-03 9.98536825e-01]
#  [2.38073454e-03 9.91847038e-01 5.77215198e-03]
#  [3.01494286e-03 9.57898378e-01 3.90867107e-02]]...
# Extract index of prediction that have highest prediction:
# [1 0 2 1 1]...
``````

### 1.6 Evaluation

Check the accuray of the model:

``````accuracy = accuracy_score(y_test, best_preds)
print(f"y_test: {y_test}")
print(f"best_preds: {best_preds}")
print(f"Accuracy: {accuracy * 100:.2f}%")

# output
# y_test: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]
# best_preds: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]
# Accuracy: 100.00%
``````

We completed the iris class prediction with 100% accuracy. This is great.

# 2. Summary

This time, try the tutorial of XGBoost. Please try to use XGB on another task or dataset.