🐳

【Method for ML】Machine Learning Pipeline

2024/09/16に公開

 FirstThis time, I will explain the flow of creating a model from scratch with PyTorch.

Please feel free to use this as a reference when building your pipeline.

 0. Check goals/dataCheck the goals/metrics and data you want to achieve and consider the method.

In particular, you should look at data from both a macro and micro perspective and understand what pre-processing and data augmentations are required to achieve the desired indicators.
It may be said that this is the most important part.

 1. Data preprocessing/storageTransform your data into a learnable shape and save it in a readable format.
Preprocessing should be accompanied by a logical justification (although it is important to try things out sometimes), and if possible, we recommend conducting in-depth experiments to ensure that the justification for your judgment is truly correct.
People who conduct data analysis should always check whether their assumptions were correct, and if unexpected results appear, they should investigate why, in order to arrive at better methods.

 2. fold divisionSplit the data by fold.

Gourp KFold, Stratified KFold, and Gourp Stratified KFold are useful.
When splitting the data, try to replicate as closely as possible the conditions in which the model will actually be used.
If you cannot create a reliable CV here, all your experiments may be in vain. Use all available methods to ensure that the distribution, quantity, variety, and quality of the data is as similar as possible to the real environment.

 3. Define dataset/data loaderIn order for PyTorch to use data efficiently in training, we define a Dataset and a DataLoader.
When defining your datasets and data loaders, choose an appropriate batch size while keeping in mind memory usage. In general, a larger batch size leads to more diversity in the data, which changes the loss function and helps prevent overfitting.
If you configure the settings so that data expansion can be applied at this time, it will be easier later.

 4. model definitionDefine the model.

You can define it yourself from the layer, or you can use an external library such as timm.

Various experiments can be considered, such as trying multiple models or using simple models as features.

 5. Learning rate scheduler definitionDefine a learning rate scheduler for more stable learning.
Cosine annealing with warmup is a strong choice, but try different techniques to find one that suits the task.

 6. optimizer definitionDefine an optimizer suitable for your model.

Adam is useful for many tasks.

 7. loss definitionDefine loss/loss function.

Measure model performance from model output and ground-truth data.

Please use auxiliary loss etc. if necessary.
This is an indicator, so be careful not to make any mistakes.

 8.CV evaluationEvaluate model performance.

Evaluation using CV (cross-validation) is common, but make sure that the CV Score and the external data score are linked before improving your CV.

 SummaryThis time, I summarized the flow of creating a model with PyTorch.

First

0. Check goals/data

1. Data preprocessing/storage

2. fold division

3. Define dataset/data loader

4. model definition

5. Learning rate scheduler definition

6. optimizer definition

7. loss definition

8.CV evaluation

Summary

Discussion