【Kaggle】BirdCLEF2024 5th place Solution explained
Hi, this time I will explain the 5th solution of BridCLEF2024 to understand it deeper.
0. Competiton explanation
This competition's purpose is classification to the 182 type bird calls.
The validation method is AUC-ROC, so we had to predict all of birds classes.
Competition host provide the random length audio calling data over 240000 samples as training data, and data not labeled for utilize to augmentation or otehrs.
As the problem must be addressed were:
unbalanced data, probably noisy and mixed test data, data while bird not calling, etc..
Additionaly point to be noted is the limitation of submittion. This competition only allow CPU and under 120 minutes for inference. We have to both infer and process the data in time.
If you wanna know details, reccoemnd you check the competition overview and dataset.
0.1 Point to see
- HGNet: Has CNN and similar mechanism to transformer
- Simple and strong architecture
1. Data
Only used BirdCLEF2024's dataset.
2. Models
Model | Backbone | Weights | Public | Private |
---|---|---|---|---|
Raw Signal | HGNetB0 | 3/5 folds | 0.720354 | 0.666671 |
Spectrum | EfficientNetB0 | 1/5 folds | 0.708198 | 0.672360 |
Raw+Spectrum | HGNetB0+EffB0 | 1/5 folds | 0.698408 | 0.660503 |
Ensemble | … | 0.5+0.4+0.1 | 0.743960 | 0.687173 |
3. Model Inputs
3.1 Raw signal model
# x: bs x 160000
# bs x 80000 x 2
x = x.view(bs, -1, 2)
# bs x 2 x 80000
x = torch.transpose(x, 2, 1)
# bs x 2 x 1000 x 80
x = x.view(bs, 2, -1, 80)
feature = self.backbone(x)
This code performing a series of reshaping and transposing operations on the input tensor x
to prepare it for feature extraction by a neural network backbone. This process includes:
Reshaping the original tensor to split it into a more structured format (bs x 80000 x 2
).
Transposing dimensions to align the tensor correctly for further processing (bs x 2 x 80000
).
Further reshaping to fit the input requirements of the backbone model (bs x 2 x 1000 x 80
).
3.2 Spectrum model
torchaudio.transforms.MelSpectrogram(
32000,
n_mels=512,
f_min=0,
f_max=16000,
n_fft=2048*2,
hop_length=512,
normalized=True,
),
torchaudio.transforms.AmplitudeToDB(top_db=80)
This transforms sigals to melspectrogram.
3.3 Mix model
raw_f = self.raw_model(x)
spec_f = self.spec_model(x)
feature = torch.cat([raw_f,spec_f],dim=1)
x=self.fc(feature)
## spectrum is resized to 256x256 for fast speed,
## other params for mel spectrum is the same with spectrm model.
This code ensembling both raw and spectrum model by fully conecction.
4. Preprocess
- Example code
if max(wave) > 1:
wave/max(wave)
5. Augmentation
- XY cut out for spectrum model
- Mixup with p=1(100%) for train.
6. Train
6.1 Train/Infer
Random sample 5 seconds data for trainning, first 5 seconds data for validation.
6.2 Loss
BCEWithLogitsLoss
7. Inference
Used openvino
8. Extra
author maked a discussion to method feeding raw signal to vision model, this is interesting, so please take a look.
Intro of discussion:
Discussion