<ul>
<li>
<a href="https://lmsys.org/blog/2023-06-22-leaderboard/" target="_blank" rel="nofollow noopener noreferrer">Chatbot Arena Leaderboard</a><br>
<a href="https://arxiv.org/abs/2306.05685" target="_blank" rel="nofollow noopener noreferrer">"Judging LLM-as-a-judge with MT-Bench and Chatbot Arena"</a>に基づいたベンチマーク。最も信頼できる。Vicunaモデルと同じ組織が運営</li>
<li>
<a href="https://github.com/FranxYao/chain-of-thought-hub/blob/main/MMLU/readme.md" target="_blank" rel="nofollow noopener noreferrer">https://github.com/FranxYao/chain-of-thought-hub/blob/main/MMLU/readme.md</a><br>
各モデルの評価を再実施している。falcon-40BよりLLama-33Bや65Bのほうが良いことがわかる</li>
<li>
<a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank" rel="nofollow noopener noreferrer">huggingfaceのopen_llm_leaderboard</a><br>
信頼できない。現在falcon-40b-instructが一番良いように見えるが、<a href="https://github.com/artidoro/qlora/issues/138#issuecomment-1585757813" target="_blank" rel="nofollow noopener noreferrer">実際にはLLama-65Bのほうが良い</a>
</li>
</ul>


LLMのベンチマークまとめ

Discussion