【Kaggle】Kaggle validation strategy ~Trust CV?~
1. What is this article
This is a summary of this great slide for me and others.
This slide is so insightful and contains so many tips to get the medal of Kaggle, so I recommend you to read it if you haven't read it yet.
Now, let's start.
2. What type of competition the Kaggle is?
It is (often) a competition in which participants use given information (training data) to optimize a set evaluation index over test data.
2.1 Details
Many times, setting an optimal validation strategy that mimics the situation of test data and optimizing validation score will be an effective approach.
(using the validation score as a reference for test score)
On the other hand, there are exceptional competitions by the difficult validation or not enough to only improve of validation score.
3. Basis of Validation
As a premise, important to prevent reak.
The KFold validation(KFold, StratifiedKFold, GroupKFold, StratifiedGroupKFold, etc.) is used many times, we should choose the method that is most likely to reproduce the test data situation.
(More complex partitioning methods may be required)
Based on this, it is probably best to think in the order trust cv+lb > trust cv > trust lb.
4. Summary
basically, trust cv is most important.
Additionally, the methodology like OUSM(ignore several minibatch data that have a big loss), mixup that robust against to noize.
The documents of the Kagger club are so useful and learnable, please read the original one in reference.
Discussion