🦚

# 【Kaggle】BirdCLEF2024 5th place Solution explained

2024/06/20に公開

Hi, this time I will explain the 5th solution of BridCLEF2024 to understand it deeper.

# 0. Competiton explanation

This competition's purpose is classification to the 182 type bird calls.
The validation method is AUC-ROC, so we had to predict all of birds classes.

Competition host provide the random length audio calling data over 240000 samples as training data, and data not labeled for utilize to augmentation or otehrs.

As the problem must be addressed were:
unbalanced data, probably noisy and mixed test data, data while bird not calling, etc..

Additionaly point to be noted is the limitation of submittion. This competition only allow CPU and under 120 minutes for inference. We have to both infer and process the data in time.

If you wanna know details, reccoemnd you check the competition overview and dataset.

### 0.1 Point to see

• HGNet: Has CNN and similar mechanism to transformer
• Simple and strong architecture

# 1. Data

Only used BirdCLEF2024's dataset.

# 2. Models

Model Backbone Weights Public Private
Raw Signal HGNetB0 3/5 folds 0.720354 0.666671
Spectrum EfficientNetB0 1/5 folds 0.708198 0.672360
Raw+Spectrum HGNetB0+EffB0 1/5 folds 0.698408 0.660503
Ensemble 0.5+0.4+0.1 0.743960 0.687173

# 3. Model Inputs

#### 3.1 Raw signal model

``````# x: bs x 160000
# bs x 80000 x 2
x = x.view(bs, -1, 2)
# bs x 2 x 80000
x = torch.transpose(x, 2, 1)
# bs x 2 x 1000 x 80
x = x.view(bs, 2, -1, 80)
feature = self.backbone(x)
``````

This code performing a series of reshaping and transposing operations on the input tensor `x` to prepare it for feature extraction by a neural network backbone. This process includes:

Reshaping the original tensor to split it into a more structured format (`bs x 80000 x 2`).
Transposing dimensions to align the tensor correctly for further processing (`bs x 2 x 80000`).
Further reshaping to fit the input requirements of the backbone model (`bs x 2 x 1000 x 80`).

#### 3.2 Spectrum model

``````torchaudio.transforms.MelSpectrogram(
32000,
n_mels=512,
f_min=0,
f_max=16000,
n_fft=2048*2,
hop_length=512,
normalized=True,
),
torchaudio.transforms.AmplitudeToDB(top_db=80)
``````

This transforms sigals to melspectrogram.

#### 3.3 Mix model

``````raw_f  = self.raw_model(x)
spec_f = self.spec_model(x)
feature = torch.cat([raw_f,spec_f],dim=1)
x=self.fc(feature)
## spectrum is resized to 256x256 for fast speed,
## other params for mel spectrum is the same with spectrm model.
``````

This code ensembling both raw and spectrum model by fully conecction.

# 4. Preprocess

• Example code
``````if max(wave) > 1:
wave/max(wave)
``````

# 5. Augmentation

• XY cut out for spectrum model
• Mixup with p=1(100%) for train.

# 6. Train

### 6.1 Train/Infer

Random sample 5 seconds data for trainning, first 5 seconds data for validation.

### 6.2 Loss

BCEWithLogitsLoss

Used openvino

# 8. Extra

author maked a discussion to method feeding raw signal to vision model, this is interesting, so please take a look.

Intro of discussion: