【Kaggle】ISIC2020 2nd Solution explained
This time, I'll explain the 2nd solution of Kaggle ISIC2020 competition.
1. Batch Size
From experiments, big batch size is effective to the score, so set up 4x Quadro RTX6000 24GB and using batch size 64 (and if that would not fit, he settled for BS 32 with gradient accumulation 2)
2. Backbones & Image Resolution
After trying several backbones like Efficientnet, se-resnext, resnest, he decided to tackle the remainder of the competition with efficientnet.
And he found that smaller EfficientNets at higher resolutions were not as good. The best models for him were EfficientNet-B6 (initialized with noisy student weights) and EfficientNet-B7, at image resolutions of 512 x 512 and 640 x 640, respectively, so he only used these moving forward.
3. Pooling
He used a standard CNN model with generalized mean pooling with trainable parameter p(not sure the GeM is better than average or max, as he just used it from beginning).
He also used multisample dropout.
4. Increasing Class to 3 from 2
He felt that more granular classes would result in better feature representations that could help improve performance.
The 2019 data all had auxiliary diagnoses, including nevi, whereas a large fraction of the 2020 data was unknown.
So he trained a model on 2019 data only using the diagnosis as the target and applied the model to 2020 data.
He searched the threshold at which he would label an image as nevus, finally, he used the 5th percentile of the 2019 model’s predictions on the 2020 data which had a known label of nevus.
And now, all of the 2019 and 2020 data has a label of other, benign nevus, or melanoma, and he trained my model on these 3 classes using vanilla cross-entropy loss.
He did not try label smoothing.
5. Upsampling
There were many discussions over whether or not to upsample malignant images.
He upsampled 7 times to avoid the 2019 melanomas data overwhelms 2020.
6. Training
Optimizer: AdamW
LR Scheduler: Cosine annealing with warm restarts
Initial learning rate: 3.0e-4
snapshots: 3
epoch: 2 epochs each for EfficientNet-B6 and 3 epochs each for EfficientNet-B7
fold: 5 in every experiment, triple stratified splits(Single validation folds were not stable for me)
validated data: 2020 only
Metadata Utilization
In the beginning, used embedded age, sex, and location as features before the linear layer by concat to the image model's output.(imputation is mean/mode)
He didn't spend a lot of time because he was afraid the overfitting from it.
The last several days of the competition, he tried to train models without metadata for "without context special prize". It turns out that these models were actually my highest scoring private solutuions.
7. Augmentations
The augmentations are important because the melanoma(malignant) images are low in 2020 data.
He used RandAugment strategy, implemented here with N=3, magnitude=M/30 where M was sampled from Poisson distribution with mean 12 for extra stochasticity.
N is the number of augmentation transformations to apply sequentially, and
M is the magnitude for all the transformations.
For augmentations like flips, M is not relevant. He tried other augmentations such as mixup, cutmix, and grid mask, but those did not help.
He also used square cropping during training and inference.
Training: Cropped as a square shape fit to a smaller size(for example, 768x512 would be to 512x512)
Inference: Cropped as a square shape fit to smaller size 10 times, and the predictions would be averaged as TTA(already square images, then no TTA was applied).
He found that this gave him better results than rectangular crops or using the whole image.
8. Pseudolabeling
Pseudolabeling is the key to his solution. Given the limited number of 2020 melanomas, He felt that pseudolabeling would help increase performance. 2019 melanomas were helpful but still different from 2020 melanomas.
He took my 5-fold EfficientNet-B6 model, trained without metadata, and obtained soft pseudolabels (3 classes) for the test set.
When combining the test data with the training data (2019+2020), He upsampled images with melanoma prediction > 0.5 7 times (same factor as I did for 2020 training data).
He used @cpmpml's implementation (https://www.kaggle.com/c/siim-isic-melanoma-classification/discussion/173733) of cross entropy in PyTorch (without label smoothing) so he could use soft pseudolabels.
9. CV vs LB
He knew early on that it would be easy to fit public LB, given the small number of melanomas that would be in the pubric test. At the same time, the CV for his different experiments was much tighter than the LB, so he was nervous to trust CV as there may have been differences between training and test data.
With that in mind, he favored solutions that had reasonably high CV and LB.
There is a fair amount of luck that goes into picking the right solution, but you should be able to justify to yourself why you are picking a certain solution over another (going by CV score, LB score, some combination of CV/LB, or some hypotheses about the private test set that would favor one solution over another).
His 2nd place solution was an ensemble of 3 models with 5-fold.
・EfficientNet-B6, 512x512, BS64, no metadata (CV 0.9336 / public 0.9534)
・EfficientNet-B7, 640x640, BS32, gradient accumulation 2, no metadata (CV 0.9389 / public 0.9525)
・Model 1, trained on combined training and pseudolabeled test data (CV 0.9438 / public 0.9493)
It seems the bigger models performed high score.
Note that the CV score does not account for the 5-fold blend effect.
His highest scoring private LB solution was actually model 3 alone, which he did not select.
He trained other models with metadata, but best private LB score was 0.945 (public LB 0.959), with similar CV.
10. Envrionment
PyTorch 1.6
Enable amp
4x NVIDIA GTX 1080 Ti 11GB
4x Quadro RTX 6000 24GB
Totally 140GB VRAM
11. Summary
This time, I explained the 2nd solution of the ISIC2020 competition.
This notebook is so insightful and told me so much, the impressive content is the CV/LB selection of chapter 9.
you should be able to justify to yourself why you are picking a certain solution over another.
This it a powerful word.
If you inretested in the soluiton, please try to see the original notebook.
Reference
[1] Kaggle, [2nd place] Solution Overview
[2] PaperWithCode, RandAugment
Discussion