🐎

PyTorch's optimizer explained

2024/05/06に公開

1. What is optimizer?

PyTroch's optimizer is an instance that configures backpropagation method settings and updates parameters.
Find the value to be updated from the loss function result.

2. Definition

Optimizer can be used just by setting the arguments.

・Example: SGD (Stochastic Gradient Descent)

import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

model.parameters(): all learnable parameters of the model
lr: learning rate
momentum: momentum. Use the previous update amount as parameter acceleration

This connects the model and the optimizer, and the optimizer also connects scheduler_lr.

3.optimizer

1. torch.optim.SGD (Stochastic Gradient Descent):

This is an optimizer that implements stochastic gradient descent.
Parameter update formula: param = param - learning_rate * gradient
Adding a momentum term smoothes the movement of the gradient and helps escape from the local optimum.
Setting the learning rate is important, and you need to choose an appropriate value depending on the problem.

2. torch.optim.Adam (Adaptive Moment Estimation):

Optimizer with adaptive learning rate.
Adaptively adjust the individual learning rate for each parameter.
The first moment (mean) and second moment (variance) of the gradient are used to perform parameter updates.
It is robust to learning rate settings and performs well on many problems.
Hyperparameters include betas (attenuation rate of first and second moments) and eps (minimal value to prevent division by zero).

3. torch.optim.RMSprop (Root Mean Square Propagation):

An optimizer that uses the second moments (variance) of the gradient to compute an adaptive learning rate.
Performs parameter updates using an exponential moving average of past squared slope values.
It is robust to learning rate settings and is suitable for problems with sparse gradients.
Hyperparameters include alpha (learning rate) and eps (minimal value to prevent division by zero).

4. torch.optim.Adagrad (Adaptive Gradient):

It is an optimizer with adaptive learning rate for each parameter.
Perform parameter updates using the sum of squares of past gradients.
As learning progresses, the learning rate decreases monotonically.
It is suitable for datasets with sparse features and problems with high dimensionality of features.
The hyperparameters are lr (learning rate) and eps (minimal value to prevent division by zero).

5. torch.optim.Adadelta (Adaptive Delta):

This is an optimizer that solves the problem of monotonically decreasing learning rate in Adagrad.
Instead of the sum of squares of past gradients, we use the sum of squares of past updates.
There is no need to set the learning rate, and fewer hyperparameter adjustments are required.
Hyperparameters include rho (attenuation rate of past update amount) and eps (minimal value to prevent division by zero).

6. torch.optim.AdamW (Adam with Weight Decay):

A variation of the Adam optimizer that adds weight decay regularization.
Weight decay regularization controls model complexity and suppresses overfitting.
The hyperparameters are lr (learning rate), betas (decay rate of first and second moments), eps (minimal value to prevent division by zero), and weight_decay (weight decay rate).

7. torch.optim.ASGD (Averaged Stochastic Gradient Descent):

A variant of stochastic gradient descent, an optimizer that uses a moving average of parameters.
Accumulates the parameter update values and updates the parameters with the average value at regular intervals.
It may show good generalization performance in the later stages of learning.
The hyperparameters are lr (learning rate), lambd (moving average decay rate), and alpha (moving average weighting).

8. torch.optim.LBFGS (Limited-memory BFGS):

A type of quadratic optimization algorithm, which is an optimizer that uses an approximation of the Hessian matrix.
It uses limited memory rather than a full calculation of the Hessian matrix so that it can be applied to large-scale problems.
There is no need to set a learning rate, and fast convergence can be expected.
Hyperparameters include max_iter (maximum number of iterations), max_eval (maximum number of function evaluations), tolerance_grad (threshold for determining convergence of gradient), and tolerance_change (threshold for determining convergence of change in function value).

I have introduced various models, but I think it is a good idea to try Adam first as it shows good performance on many problems.

4. How to use

After forward and backward propagation of the model, update the parameters by writing optimizer.step().

for input, target in dataset:
     optimizer.zero_grad() # reset gradient to zero
     output = model(input) # Perform forward propagation on the model
     loss = loss_fn(output, target) # Calculate loss
     loss.backward() # calculate gradient
     optimizer.step() # update model parameters

It can be written very simply.

Summary

There are various potimizer, however I reccomend to try Adam at first as conclution.
Thank you for reading.

Discussion