🗻

LLMのベンチマーク・評価に関する情報

2023/12/10に公開

とりあえずメモです。日本語のもの中心です。

またそのうち追記します。

リーダーボード

JGLUE

https://wandb.ai/wandb/LLM_evaluation_Japan/reports/LLM-JGLUE---Vmlldzo0NTUzMDE2?accessToken=u1ttt89al8oo5p5j12eq3nldxh0378os9qjjh14ha1yg88nvs5irmuao044b6eqa

https://github.com/yahoojapan/JGLUE

Heron VLM

https://wandb.ai/vision-language-leaderboard/heron-leaderboard/reports/Heron-VLM-powered-by-nejumi-WandB--Vmlldzo3ODM4ODYw?accessToken=pe0az0wpcay7g7u3eg8cgtu0xkbvhfkcr920f5gktffvz2zd3yra233j1et7c5ov

オープン日本語LLMリーダーボード

https://gigazine.net/news/20241126-open-japanese-llm-leaderboard/

Chatbot Arena

異なる2つのLLMの出力をオンラインで人間が勝敗を判定する仕組みです。

https://chat.lmsys.org/

クローズなLLMとオープンなLLMの比較
https://huggingface.co/spaces/andrewrreed/closed-vs-open-arena-elo

日本語に強いLLM
https://huggingface.co/spaces/yutohub/japanese-chatbot-arena-leaderboard

ベンチマークのソフト

https://github.com/microsoft/promptbench

https://github.com/llm-jp/llm-jp-eval

https://yuzuai.jp/blog/rakuda

https://github.com/explodinggradients/ragas

https://github.com/run-llama/llama_index/blob/main/docs/examples/evaluation/prometheus_evaluation.ipynb

https://github.com/VILA-Lab/ATLAS

https://github.com/gkamradt/LLMTest_NeedleInAHaystack

https://github.com/elith-co-jp/langdechat

https://tech.algomatic.jp/entry/2024/04/10/183001

https://github.com/openai/simple-evals

https://github.com/anthropics/courses/tree/master/prompt_evaluations

ベンチマーク実践

https://qiita.com/wayama_ryousuke/items/a58791cdc2a05847824d

https://qiita.com/wayama_ryousuke/items/105a164e5c80c150caf1

https://www.sbintuitions.co.jp/blog/entry/2024/05/16/130848

LLM as a Judge

https://eugeneyan.com/writing/llm-evaluators/

参考リンク

https://note.com/wandb_jp/n/n2464e3d85c1a

https://github.com/yuzu-ai/japanese-llm-ranking

https://note.com/npaka/n/n0530f6f9123f

https://note.com/shi3zblog/n/n03bdb67370aa

https://wandb.connpass.com/event/300670/presentation/

https://note.com/shi3zblog/n/n6b2ac5874021

https://nikkie-ftnext.hatenablog.com/entry/lm-evaluation-harness-open-calm-7b-jcommonsenseqa

https://drive.google.com/file/d/1nQlHckrkCag-_hHrMc_5jGsnY9-keBJc/view

https://note.com/npaka/n/n44252e28e70a

https://www.bioerrorlog.work/entry/langcheck-llm-evaluation

https://www.bioerrorlog.work/entry/llm-model-based-eval-openai-practice

https://acro-engineer.hatenablog.com/entry/2023/11/29/000000

https://github.com/llm-jp/awesome-japanese-llm

https://docs.google.com/presentation/d/1MaIQi-AANQCh3TgACtx10eBwViamth-Y/

https://docs.google.com/presentation/d/1EMd6qcJg1yDdyopbvSIp-TkMeDfLQy1T/

https://www.docswell.com/s/DeepLearning2023/538DRY-2023-12-22-105000

https://zenn.dev/turing_motors/articles/8e913f46374ede

https://lifearchitect.ai/models/

https://zenn.dev/seya/articles/dd0010601b3136

https://github.com/SingularitySociety/WorldModels

https://tech.layerx.co.jp/entry/2024/11/18/151901

https://llm-jp.github.io/awesome-japanese-llm/

まとめ

https://note.com/npaka/n/ndec10f78fe2f

https://zenn.dev/pakas/articles/80f797b0c3ae1e

https://qiita.com/s-nagase/items/2baced05d9db8efcf073

https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5

https://www.brainpad.co.jp/doors/contents/01_apply_generative_ai_to_business/

https://speakerdeck.com/rishigami/llmping-jia-noluo-tosixue-kai-fa-zhe-mu-xian-deqi-wotukerupointo

https://speakerdeck.com/asei/sheng-cheng-ai-noping-jia-fang-fa

Discussion