【Kaggle】BirdCLEF2024 2nd place Solution explained


Hi, this time I will explane the second solution of BridCLEF2024 to understand it deeper. Let's enjoy together!


0. Competiton explanation

This competition's purpose is classification to the 182 type bird calls.
The validation method is AUC-ROC, so we had to predict all of birds classes.

Competition host provide the random length audio calling data over 240000 samples as training data, and data not labeled for utilize to augmentation or otehrs.

As the problem must be addressed were:
unbalanced data, probably noisy and mixed test data, data while bird not calling, etc..

Additionaly point to be noted is the limitation of submittion. This competition only allow CPU and under 120 minutes for inference. We have to both infer and process the data in time.

1. Summary

  • Only first 5 seconds of each recording is used for training
  • EfficientNet B0 backbone
  • Performance boost by using pseudo labels from target domains
  • Ensemble of 6 models
  • Model diversity through different Mel parameters, data subsets, image size and probabilities to add pseudo labels

2. First experiments

Starting this competition is so difficut, some improving CV isn't contribute LB score and he wasn't able to get past 0.64 AUC for single models or 0.69 for ensembles.

When he almost thinking giving up, he restart from scratch the public notebook from @salmanahmedtamu following methods discussed by @lihaoweicvch.

With this it was possible to get simgle model performance of 0.65/0.66 AUC.

The original notebook uses:
  • Data: Only first 5s of training files and no extrafiles or classes from other sources
  • Input: Resized 3 channel Mel spec images of size 256x256
  • Backbone: eca_nfnet_l0
  • Mel Params:
    n_fft = 2048
    hop_length = 512
    n_mels = 128
    f_min = 20
    f_max = 16000
  • Training config:
    • CosineAnnealingLR scheduler with 5 warmup epochs
    • Peak learning rate 1e-4
    • 100 epochs with early stopping if AUC is not improving for 7 epochs
    • Batch size 64
    • Average of BCE and FocalLoss
    • GEM pooling(learnable pooling)
  • Augmentations:
    • HorozontalFlip
    • CoarseDropout
    • Mixup of Mel spec images within training batches

The public notebook has long inference time (>1h), so he switched model to efficientnet(tf_efficientnet_b0_ns) and reduced image sizes. This change make LB score unstable(0.62-0.66) and sensitive to different combinations of mel parameters and input sizes. But with further adjustments he was able to create models with inference time below 12 minutes, still keeping a score around 0.65 AUC

main changes to the original notebook include:
  • Backbone: tf_efficientnet_b0_ns
  • 5 dropout layers before fc layer (inspired by BirdCLEF2023 4th place and BirdCLEF2021 2nd place)
  • Higher learning rate (1e-3), less warmup epochs (3) and less epochs (50)
  • Different Mel parameters (n_mels, hop_length)
  • Additional augmentation: local and global time/frequency stretching perfoermed on Mel spec images via resizing parts and entire image
  • Creating checkpoint soup instead of using early stopping

Creating checkpoint soups follows idea of model soups. But here, weights of the same model from different checkpoints from epochs 13-50are averaged if they show an improvement in local CV score on one of the tracked metrics (LRAP, cMAP, F1, AUC). This led to more stable and sometimes even better LB socores.

With all above mentioned modifications, author was able to create an ensemble of 6 models achieving 0.70 AUC on public LB. Not great but good enougth to start playing around with psedo labels created from the unlabeled dataset.

3. Performance boost with pseudo labels

Using pseudo labels from test domain and their handling during training was the key element to move up to the top 10 in the leaderboard. Pseudo labels were created by applying the model ensemble with the best public LB score on the unlabled data to create predictions for each 5s interval of each file.

After that, next stage's training data are added with 5s unlabeled data at 25-45% chance. Before mixing both, amplitudes of both waveform are multiplied by a random coefficient.

The target vector(label array) of training sample(1.0 at positions of primary and secondary species, rest all zeros) is combined with the pseudo label (vector with predicted probabilities) to form the new target vector by taking the maximum of both.

this method of using the pseudo labeled test data combines several advantages:

  1. Noise augmentation: By mixing training samples with samples from the target domain the model learns how species sound within the environmental background noise of the test site habitat. This helps to address the domain shift between Xeno-Canto recordings and test soundscapes.
  2. Additional training data: The model gets more training samples representing the noise characteristics and species distribution of the target domain.
  3. Knowledge distillation: Since pseudo labels are derived from predictions of a stronger model (or ensemble of models in this case), its knowledge is transferred during training to the smaller model

After integrating pseudo labels, LB score significantly increased. The new ensemble was then used again to create a new set of pseudo labels. This cycle was repeated 3 times to iteratively improve model/ensemble performance. Progress on LB score (publ.|priv.) is shown in the following table:

This is so interesting, ensemble model train new models, and them will be new ensemble model. The improvement loop is complete here.

At 2nd stage, author added some normalization to allow stabel model taraning because label minimum values got too lerge. After 3 ensemble didn't impact to LB socre, so wasn't picked.

2nd stage model's parameter are shown in the following table:

Params./Model ID 1 2 3 4 5 6
seed 42 42 42 42 70 42
n_folds 5 5 5 5 10 5
fold 4 1 4 4 0 4
dataset bc24 bc24 bc24 bc24 bc24+ bc24
n_mels 128 128 128 64 64 64
hop_length 512 512 1024 1024 1024 1024
image_height 256 256 128 64 64 64
image_width 256 256 128 128 128 64
pseudoLabelChance 35 % 40 % 45 % 30 % 30 % 25 %
ampExpMin -0.5 -1.0 -0.5 -0.5 -0.5 -0.5
ampExpMax 0.1 0.2 0.1 0.1 0.1 0.1
Inference time ~ 50 min. ~ 50 min. ~ 17 min. ~ 12 min. ~ 12 min. ~ 11 min.
Public LB score 0.73270 0.71975 0.71104 0.69936 0.69124 0.69309
Private LB score 0.68521 0.68533 0.68116 0.67445 0.64543 0.65862

Model diversity in the ensemble was realized by different Mel parameters, data subsets, image sizes, chances to add pseudo labels and amplitude factors to change the volume relation between training and pseudo label data. Parameters ampExpMin and ampExpMax provide the range for the random amplitude factor multiplied to training and pseudo label samples to change their volume in the mix: ampFactor = 10**(random.uniform(ampExpMin,ampExpMax))

Model 5 is the only one using external data. For this, additional files for the 182 species of the competition were downloaded from Xeno-Canto and the first 5 seconds part of each file was added to the training set (padded with zeros if too short).

4. Post processing

MOdels were ensembled by simply taking the mean of predictions (probabilities from sigmoid out puts) of each single model. As a last step, for a each file, predictions of a given window, were summed with those of the two neighboring windows with factor 0.5.

This post-processing method was used by @theoviel and his team in the 3rd place solutionof the Cornell Birdcall Identification competition.

I think 4th solution team of this competition also used this post processing. It seems so useful.

5. Optimizations for inference

  • Parallel preprocessing of testaudio files via multithreading
  • Precalculation of different versions of mel spec and reusing them for different models
  • Adding models using smallerimage sizes as input
  • Setting a 2h timer to prevent submission timeout for larger ensembles

6. What didn't work

Too many things. Unfortunately, it was not until much later in the competition that we found a good way to use pseudo-labels. Looking at the submissions after the deadline, it is clear that many things that did not work at first can be very beneficial in improving private LB scores when pseudo-labels are included. Hopefully we can continue these experiments at the next convention ...

Techinique used

  • pseudo labeling with average loss of BCE and FocalLoss
  • GEM pooling
  • emsemble with different mel params(n_mels, hop_length)
  • 5 dropout layers before fc layer
  • Checkpoint soup(instead of early stopping)
  • neighborhood 0.5 times post prosessing

Point to see

  • Checkpoint soup (chap 2)
  • Reducing image sizes for faster inference (chap 2)
  • Ensemble improving cycle by pseudo labels (chap 3)
  • stable model by different mel params and random amplitude(coefficient) (chap 3)


pseudo labeling and inproving ensemble cycle has high impact to create better models.
In addition, a number of other techniques were used to reduce and stabilize the model. Please check out the original solution.


[1] BirdCLEF2024 2nd place solution