RustのAVX512対応

SIMD系はこのリポジトリで行なわれている。

特に、AVX512はこのissueでトラックされているが近年大きな動きはない。

実装状況はここにあるが、 i1 型がないと実装できない命令が残っているのでマージされていないよう。

↑はAVX512fの話だったが、AVX512はいくつかのセットに分かれる。

The AVX-512 instruction set consists of several separate sets each having their own unique CPUID feature bit; however, they are typically grouped by the processor generation that implements them.

F, CD, ER, PF

Introduced with Xeon Phi x200 (Knights Landing) and Xeon Gold/Platinum (Skylake SP "Purley"), with the last two (ER and PF) being specific to Knights Landing.

AVX-512 Foundation (F) – expands most 32-bit and 64-bit based AVX instructions with the EVEX coding scheme to support 512-bit registers, operation masks, parameter broadcasting, and embedded rounding and exception control, implemented by Knights Landing and Skylake Xeon

AVX-512 Conflict Detection Instructions (CD) – efficient conflict detection to allow more loops to be vectorized, implemented by Knights Landing[1] and Skylake X

AVX-512 Exponential and Reciprocal Instructions (ER) – exponential and reciprocal operations > designed to help implement transcendental operations, implemented by Knights Landing[1]

AVX-512 Prefetch Instructions (PF) – new prefetch capabilities, implemented by Knights Landing[1]

VL, DQ, BW

Introduced with Skylake X and Cannon Lake.

AVX-512 Vector Length Extensions (VL) – extends most AVX-512 operations to also operate on XMM (128-bit) and YMM (256-bit) registers[3]

AVX-512 Doubleword and Quadword Instructions (DQ) – adds new 32-bit and 64-bit AVX-512 instructions[3]

AVX-512 Byte and Word Instructions (BW) – extends AVX-512 to cover 8-bit and 16-bit integer operations[3]

IFMA, VBMI

Introduced with Cannon Lake.[4]

AVX-512 Integer Fused Multiply Add (IFMA) – fused multiply add of integers using 52-bit precision.

AVX-512 Vector Byte Manipulation Instructions (VBMI) adds vector byte permutation instructions which were not present in AVX-512BW.

4VNNIW, 4FMAPS

Introduced with Knights Mill.[5][6]

AVX-512 Vector Neural Network Instructions Word variable precision (4VNNIW) – vector instructions for deep learning, enhanced word, variable precision.

AVX-512 Fused Multiply Accumulation Packed Single precision (4FMAPS) – vector instructions for deep learning, floating point, single precision.

VPOPCNTDQ

Vector population count instruction. Introduced with Knights Mill and Ice Lake.[7]

VNNI, VBMI2, BITALG

Introduced with Ice Lake.[7]

AVX-512 Vector Neural Network Instructions (VNNI) – vector instructions for deep learning.

AVX-512 Vector Byte Manipulation Instructions 2 (VBMI2) – byte/word load, store and concatenation with shift.

AVX-512 Bit Algorithms (BITALG) – byte/word bit manipulation instructions expanding VPOPCNTDQ.

VP2INTERSECT

Introduced with Tiger Lake.

AVX-512 Vector Pair Intersection to a Pair of Mask Registers (VP2INTERSECT).

GFNI, VPCLMULQDQ, VAES

Introduced with Ice Lake.[7]
These are not AVX-512 features per se. Together with AVX-512, they enable EVEX encoded versions of GFNI, PCLMULQDQ and AES instructions.

https://en.wikipedia.org/wiki/AVX-512

κeen

この中でもAVX512DQを使いたい。なぜなら https://zenn.dev/herumi/articles/fast-exp-by-avx512 を真似してみたいから。

DQの実装状況はこのPRにある。が、やりかけのまま進んでいない。

κeen

ところでインラインアセンブラとかを使っていないことに気付く。Rustの制約によるものかと思いきや、clangも命令と等価なコードを書いて最適化に任せる方針らしい。

これで上手くいくのかなと思って適当なintrinsic書いてCompiler Explorerに投げてみたら違う命令吐かれてた…
https://godbolt.org/z/a71PWhdbj

#include <mmintrin.h>
#include <immintrin.h>

void
test()
{
    __m512 x = _mm512_set1_pd(1.0);
    __m512 y = _mm512_set1_pd(2.0);
    __m512 z = _mm512_andnot_pd(x, y);
}

test:                                   # @test
        push    rbp
        mov     rbp, rsp
        and     rsp, -64
        sub     rsp, 640
        movabs  rax, 4607182418800017408
        mov     qword ptr [rsp + 440], rax
        vbroadcastsd    zmm0, qword ptr [rsp + 440]
        vmovapd zmmword ptr [rsp + 320], zmm0
        vmovapd zmm0, zmmword ptr [rsp + 320]
        vmovapd zmmword ptr [rsp + 128], zmm0
        movabs  rax, 4611686018427387904
        mov     qword ptr [rsp + 312], rax
        vbroadcastsd    zmm0, qword ptr [rsp + 312]
        vmovapd zmmword ptr [rsp + 192], zmm0
        vmovapd zmm0, zmmword ptr [rsp + 192]
        vmovapd zmmword ptr [rsp + 64], zmm0
        vmovaps zmm1, zmmword ptr [rsp + 128]
        vmovaps zmm0, zmmword ptr [rsp + 64]
        vmovaps zmmword ptr [rsp + 512], zmm1
        vmovaps zmmword ptr [rsp + 448], zmm0
        vmovdqa64       zmm0, zmmword ptr [rsp + 512]
        vpternlogq      zmm0, zmm0, zmm0, 15
        vmovapd zmm1, zmmword ptr [rsp + 448]
        vpandq  zmm0, zmm0, zmm1
        vmovaps zmmword ptr [rsp], zmm0
        mov     rsp, rbp
        pop     rbp
        vzeroupper
        ret

インテルのドキュメントによると _mm512_andnot_pd は vandnpd に対応するはずだがその命令は吐かれていない。

κeen

そもそもLLVMがAVX512dqの命令をサポートしていない？

κeen

昔builtinがあったが、最適化に任せるように変化していったらしい

κeen

Clangレベルでビルトインがないのか、LLVMレベルでないのかはテストを見るとよさそうな気がする

が、ここにあるビルトインを生成させてみてもリンクエラーになるのでよくわからん

κeen

いや、なんか頑張って探すとエラーにならない名前がある。例えば llvm.x86.avx512.mask.broadcastf32x8だったら llvm.x86.avx512.mask.broadcastf32x8.512 が使える。引数とかはgit grepで探す感じで

κeen

そしてビルトインを使ったからといって目的の命令が吐かれる訳ではない

κeen

えー、

intrinsicを実装しても目的のコードが吐かれる訳ではない
同等のコードを書いたら最適化がなんやかんややってくれる(こともある)のでintrinsicにこだわる必要はない

ってとこ？

κeen

アセンブラ、GCCでも一応やってみたらちゃんと vandnpd が出てた。

test:
        push    rbp
        mov     rbp, rsp
        and     rsp, -64
        sub     rsp, 264
        vmovsd  xmm0, QWORD PTR .LC0[rip]
        vmovsd  QWORD PTR [rsp-72], xmm0
        vbroadcastsd    zmm0, QWORD PTR [rsp-72]
        vmovapd ZMMWORD PTR [rsp+200], zmm0
        vmovsd  xmm0, QWORD PTR .LC1[rip]
        vmovsd  QWORD PTR [rsp-64], xmm0
        vbroadcastsd    zmm0, QWORD PTR [rsp-64]
        vmovapd ZMMWORD PTR [rsp+136], zmm0
        vmovapd zmm0, ZMMWORD PTR [rsp+200]
        vmovapd ZMMWORD PTR [rsp+8], zmm0
        vmovapd zmm0, ZMMWORD PTR [rsp+136]
        vmovapd ZMMWORD PTR [rsp-56], zmm0
        vxorpd  xmm0, xmm0, xmm0
        vmovapd zmm1, ZMMWORD PTR [rsp+8]
        vmovapd zmm2, ZMMWORD PTR [rsp-56]
        mov     eax, -1
        kmovb   k1, eax
        vandnpd zmm0{k1}, zmm1, zmm2
        nop
        vmovapd ZMMWORD PTR [rsp+72], zmm0
        nop
        leave
        ret
.LC0:
        .long   0
        .long   1072693248
.LC1:
        .long   0
        .long   1073741824

https://godbolt.org/z/YM4sErTnc

κeen

設定をちょっと変えてみた

#include <mmintrin.h>
#include <immintrin.h>

__m512d
test(__m512d x, __m512d y)
{
    __m512d z = _mm512_andnot_pd(x, y);
    return z;
}

__m512
test2(__m128 x)
{
    return _mm512_broadcast_f32x2(x);
}

__m512d
test3(__m128d x)
{
    return _mm512_broadcast_f64x2(x);
}

Clang

test:                                   # @test
        vandnps zmm0, zmm0, zmm1
        ret
test2:                                  # @test2
        vbroadcastsd    zmm0, xmm0
        ret
test3:                                  # @test3
        vshuff64x2      zmm0, zmm0, zmm0, 0     # zmm0 = zmm0[0,1,0,1,0,1,0,1]
        ret

GCC

test:
        vandnpd zmm0, zmm0, zmm1
        ret
test2:
        vbroadcastf32x2 zmm0, xmm0
        ret
test3:
        vshuff64x2      zmm0, zmm0, zmm0, 0x0
        ret

https://godbolt.org/z/MdYEcG773

clangは全部違う命令吐いてるけどGCCは2つは意図どおりの命令だ。

κeen

急にRust 1.89.0でAVX 512の命令が増えた

Announcing Rust 1.89.0 | Rust Blog