🐕

【Kaggle】Kaggle validation strategy ~Trust CV?~

2024/09/03に公開

1. What is this article

This is a summary of this great slide for me and others.
This slide is so insightful and contains so many tips to get the medal of Kaggle, so I recommend you to read it if you haven't read it yet.

Now, let's start.

2. What type of competition the Kaggle is?

It is (often) a competition in which participants use given information (training data) to optimize a set evaluation index over test data.

2.1 Details

Many times, setting an optimal validation strategy that mimics the situation of test data and optimizing validation score will be an effective approach.
(using the validation score as a reference for test score)

On the other hand, there are exceptional competitions by the difficult validation or not enough to only improve of validation score.

3. Basis of Validation

As a premise, important to prevent reak.
The KFold validation(KFold, StratifiedKFold, GroupKFold, StratifiedGroupKFold, etc.) is used many times, we should choose the method that is most likely to reproduce the test data situation.
(More complex partitioning methods may be required)

Based on this, it is probably best to think in the order trust cv+lb > trust cv > trust lb.

4. Summary

basically, trust cv is most important.
Additionally, the methodology like OUSM(ignore several minibatch data that have a big loss), mixup that robust against to noize.

The documents of the Kagger club are so useful and learnable, please read the original one in reference.

Reference

[1] Kaggleへの取り組み方 ~Validation編~
[2] Kaggle金メダル獲得戦略

Discussion