🤖

sgemm(single precision matmul)性能のメモ

2024/10/10に公開

LLM

sgemm

matmul

tech

 背景LLM(attention)では gemm(matrix multiplication)の性能が重要となる.

sgemm = 単精度 gemm

(ちなみに倍精度は dgemm)

 A100 40GBhttps://x.com/syoyo/status/1844069762059599939
https://siboehm.com/articles/22/CUDA-MMM
メモリ帯域は 1.55 TB/s.

cuBLAS で 15 TFLOPS, めちゃ最適化して 20 TFLOPS(おおむね A100 の FP32 ピーク性能).
ちなみに 4090(Ada), H100 では 50 TFLOPS くらいでる模様.

 CPUhttps://x.com/syoyo/status/1844060880901906693
おおむねメモリ帯域 x 10 倍が sgemm TFLOPS な感じか.

M1 Pro(200 GB/s)で 2 TFLOPS.

Ryzen7 8 cores + DDR5 100 GB/s(と思われる) で 1~1.4 TFLOPS.

(Ryzen9 16 cores + DDR5 4 channel なら 2 TFLOPS は出ると思われる)

 富岳(A64FX)12 cores @ 2.2 GHz で 1.6 Tflops くらい.
https://x.com/syoyo/status/1844065547748851987
メモリ帯域は 256 GB/s.

こちらは演算ユニットネックであろう. 富岳での FP32 性能ピークとおおむね同じ.

Discussion

ログインするとコメントできます