【Method for ML】Machine Learning Pipeline
First
This time, I will explain the flow of creating a model from scratch with PyTorch.
Please feel free to use this as a reference when building your pipeline.
0. Check goals/data
Check the goals/metrics and data you want to achieve and consider the method.
In particular, you should look at data from both a macro and micro perspective and understand what pre-processing and data augmentations are required to achieve the desired indicators.
It may be said that this is the most important part.
1. Data preprocessing/storage
Transform your data into a learnable shape and save it in a readable format.
Preprocessing should be accompanied by a logical justification (although it is important to try things out sometimes), and if possible, we recommend conducting in-depth experiments to ensure that the justification for your judgment is truly correct.
People who conduct data analysis should always check whether their assumptions were correct, and if unexpected results appear, they should investigate why, in order to arrive at better methods.
2. fold division
Split the data by fold.
Gourp KFold, Stratified KFold, and Gourp Stratified KFold are useful.
When splitting the data, try to replicate as closely as possible the conditions in which the model will actually be used.
If you cannot create a reliable CV here, all your experiments may be in vain. Use all available methods to ensure that the distribution, quantity, variety, and quality of the data is as similar as possible to the real environment.
3. Define dataset/data loader
In order for PyTorch to use data efficiently in training, we define a Dataset and a DataLoader.
When defining your datasets and data loaders, choose an appropriate batch size while keeping in mind memory usage. In general, a larger batch size leads to more diversity in the data, which changes the loss function and helps prevent overfitting.
If you configure the settings so that data expansion can be applied at this time, it will be easier later.
4. model definition
Define the model.
You can define it yourself from the layer, or you can use an external library such as timm.
Various experiments can be considered, such as trying multiple models or using simple models as features.
5. Learning rate scheduler definition
Define a learning rate scheduler for more stable learning.
Cosine annealing with warmup is a strong choice, but try different techniques to find one that suits the task.
6. optimizer definition
Define an optimizer suitable for your model.
Adam is useful for many tasks.
7. loss definition
Define loss/loss function.
Measure model performance from model output and ground-truth data.
Please use auxiliary loss etc. if necessary.
This is an indicator, so be careful not to make any mistakes.
8.CV evaluation
Evaluate model performance.
Evaluation using CV (cross-validation) is common, but make sure that the CV Score and the external data score are linked before improving your CV.
Summary
This time, I summarized the flow of creating a model with PyTorch.
Discussion