【Kaggle】HMS 3rd Solution part3 -Great Biases-
This is a countinuing of part1, part2.
In this article, I'll explain model and beyond of this solution:
5 Models
Authors made two model.
・2DCNN model using argumented melspectrogram
・1DCNN with squeezeformer
5.1 2DCNN
At previous article.
5.2 1DCNN + squeeseformer block
Architecture is here:
First, they stakced eeg and apply butter filter(prevent high frequency component) same as 2DCNN input, and the size will be (batch size, 1, 16, 10000).
Next, extract feature by 4 phase convolution with norm and activation.
-
Extract time-wise feature by depthwise 2DConv(1x16) (works same as 1DConv) with LayerNorm+SwishActivation+Pooling(1,8)
-
First, point-wise convolution convolve featuremap expanded to filter direction by depth-wise colvolution. After LN+Swish, Extract
more deeper time-wise feature by applying depthwise conv(1x16), this time, applying point-wise convolution immediately.(pooling is for stability or efficient calculation)
※depth-wise conv + point-wise conv works like efficient convolution layer. -
Applying depth-wise conv(3x1) that sandwidhed by point-wise conv. This can treat interactions of only nearby eeg node.
By stacking multiple point-wise layers, model can learn mutual relationship of filter that used at depth-wise conv more deeper. -
Applying depth-wise conv(1x8) that sandwidhed by point-wise conv, for combine the time direction features and the egg direction features by previous (3x1)conv filter.
By those phase, we could extract raw signal's feature.
finnaly, we provide it to 3 layers squeezeformer for recognite what is happen(also, the output of squeezeformer(encoder) will be provided to linear layer(head) to create logits(values before entering into softmax etc.)).
6. Training procedure
6.1. Melspec + 2DCNN backbones
Config
- epochs: 12 to 16
- lr: cosinelr with 0.0012 init
- batch size: 32
- drop path: quite effective. value is 0.2
Using mixnet_l or mixnet_xl with expanded 10 second granularity.
6.2 1DCNN + squeezeformer
Config
- epochs: ~32
- ls: cosinele with 0.001 init
- batch size: 64(be 256 when using low sample)
- hidden dimention: 128
- dropout(attention, ff, head, conv): all 0.1
- Going deeper on hidden dimension or more layers did not help.
They trained 3 different version of the 1d model with respect to data.
- Only 8+ vote samples.
- 3-8 vote sample, 8+ vote sample. At each epoch, we sub-sampled the data set of 3-8 votes and used a different loss for each set. The weights of the 3-8 vote batches were decayed over time, starting at 0.8 weights and decaying to ~0.2 weights on a linear or cosine schedule. This damped weighted loss was added to the loss for datasets with more than 8 votes.
- Finally, we added the 1-2 votes as another damped weighted loss and trained them in the same way along with the 3-8 vote set and the 8+ vote set. The 1-2 vote set is very noisy, so it started with a weight of 0.5 and ended with a weight of 0.
For diversity, they trained some 1d models with butter bandpass order 0 and 1. They were not as strong but added some diversity in the blend.
7. Ensembling + Postprocessing
they made 18 2DCNN models and 5 1DCNN models.
15 hours before the end, with the help of GPT4, we created a simple neural net to learn the best blend weight on the 23 base models.
5 of the models were 1D CNN with different training procedures. 18 of the models were Mixnet 2d CNNs, with different backbones, and 10 second magnification windows.
It made CV score imporovement but public LB was not so large.
Earlier, they had been thinking that the distribution of the prediction were different to the distoribution seen in the 8+ votes portion of the training dataset.
So, they added a bias term for each of the 6 classes to the nnet. This bias would be added to the logits. CV difference was small, but when the Public LB score came through it jumped they from 9th to 4th position.
The nnet is quite simple:
class Net(nn.Module):
def __init__(self, n_models = 23):
super(Net, self).__init__()
self.fc = nn.Linear(n_models, 1, bias = False)
self.fc_c = torch.nn.Parameter(torch.zeros(6)[None,:,None])
def forward(self, x):
return self. fc(x) + self.fc_c
It staring with zeros. It is a common practice in neural network training. It acts as a neutral starting point, meaning the parameter begins with no influence on the network’s output. This allows the training process to adjust it in a direction that minimizes the loss, based on the gradient information it receives during backpropagation.
8. Ablation study (roughly)
・Ordering of the 16 nodes signals -0.02
・Including low votes samples in 1D model -0.01
・Augmentations -0.03 or more
・Magnify annotated window -0.006; and more by blending different views
・Remove sample wise and batch wise normalisation -0.02
・FC layer to learn blend weights -0.003
・Postprocessing with bias term -0.001
Extra
One of them used a 4090 GPU instance rented from runpod.io
Neptune.ai was our MLOps stack to track, compare and share models. It was heavily used and allowed easy 4 fold grouped view of models and mean of the val scores for all folds - which made tracking a lot easier.
Highlights in my opinion
・General normalization with x = x.clip(-1024, 1024) / 32
・Effective augmentation like masking, various narrower butter bandpass, left-right flip.
・eeg normalization by subtracting 15 other channel nodes, from each single node
・Zooming 10 second
・MixNet (I didn't know)
・Only 3 length convolution respect to nodes direction(as feature of squeezeformer)
・Droppath
・Weight decay of low trusted data
・nnet bias
Reference
[1] DIETER, 3rd place solution, kaggle, 2024
[2] darraghdog, kaggle-hms-3rd-place-solution, github, 2024
Discussion