🦚

【Kaggle】BirdCLEF2024 5th place Solution explained

2024/06/20に公開

機械学習

Kaggle

tech

Hi, this time I will explain the 5th solution of BridCLEF2024 to understand it deeper.

0. Competiton explanation

This competition's purpose is classification to the 182 type bird calls.
The validation method is AUC-ROC, so we had to predict all of birds classes.

Competition host provide the random length audio calling data over 240000 samples as training data, and data not labeled for utilize to augmentation or otehrs.

As the problem must be addressed were:
unbalanced data, probably noisy and mixed test data, data while bird not calling, etc..

Additionaly point to be noted is the limitation of submittion. This competition only allow CPU and under 120 minutes for inference. We have to both infer and process the data in time.

If you wanna know details, reccoemnd you check the competition overview and dataset.
https://www.kaggle.com/c/birdclef-2024

0.1 Point to see

HGNet: Has CNN and similar mechanism to transformer
Simple and strong architecture

1. Data

Only used BirdCLEF2024's dataset.

2. Models

Model	Backbone	Weights	Public	Private
Raw Signal	HGNetB0	3/5 folds	0.720354	0.666671
Spectrum	EfficientNetB0	1/5 folds	0.708198	0.672360
Raw+Spectrum	HGNetB0+EffB0	1/5 folds	0.698408	0.660503
Ensemble	…	0.5+0.4+0.1	0.743960	0.687173

3. Model Inputs

3.1 Raw signal model

# x: bs x 160000
# bs x 80000 x 2
x = x.view(bs, -1, 2)
# bs x 2 x 80000
x = torch.transpose(x, 2, 1)
# bs x 2 x 1000 x 80
x = x.view(bs, 2, -1, 80)
feature = self.backbone(x)

This code performing a series of reshaping and transposing operations on the input tensor x to prepare it for feature extraction by a neural network backbone. This process includes:

Reshaping the original tensor to split it into a more structured format (bs x 80000 x 2).
Transposing dimensions to align the tensor correctly for further processing (bs x 2 x 80000).
Further reshaping to fit the input requirements of the backbone model (bs x 2 x 1000 x 80).

3.2 Spectrum model

torchaudio.transforms.MelSpectrogram(
              32000,
              n_mels=512,
              f_min=0,
              f_max=16000,
              n_fft=2048*2,
              hop_length=512,
              normalized=True,
          ),
torchaudio.transforms.AmplitudeToDB(top_db=80)

This transforms sigals to melspectrogram.

3.3 Mix model

raw_f  = self.raw_model(x)
spec_f = self.spec_model(x)
feature = torch.cat([raw_f,spec_f],dim=1)
x=self.fc(feature)
## spectrum is resized to 256x256 for fast speed, 
## other params for mel spectrum is the same with spectrm model.

This code ensembling both raw and spectrum model by fully conecction.

4. Preprocess

Example code

if max(wave) > 1:
    wave/max(wave)

5. Augmentation

XY cut out for spectrum model
Mixup with p=1(100%) for train.

6. Train

6.1 Train/Infer

Random sample 5 seconds data for trainning, first 5 seconds data for validation.

6.2 Loss

BCEWithLogitsLoss

7. Inference

Used openvino

8. Extra

author maked a discussion to method feeding raw signal to vision model, this is interesting, so please take a look.

Intro of discussion: