C#最適化　関数呼び出し調査編

akeit0

記事にするよりとりあえず、思いついたものをスクラップに加えていく。
あとでまとめよう。

akeit0

これと内容としては書こうと思ってたことが結構被ってるけど、ベンチマークとかアセンブラも確認していく

akeit0

ここにコードとベンチマーク結果を置いておきます。
Public domainです。

akeit0

関数の呼び出しのベンチマーク

links

コード

結果

アセンブラ

ということで、軽く説明をしていく。

ハイライト的にGithub側で見ながらがこのスクラップを見た方が良さそう。

akeit0

基礎知識

JITに関する関数の設定を行う。
NoInliningはInline化を防止する
AggressiveOptimizationは関数の最適化を最初から行う。
この二つで細かな最適化ができる。

intel x64をIntel記法のアセンブラで見ていく。
アセンブラについて深入りはしない。
Intel記法: [ディスティネーションオペランド] <- [ソースオペランド]
レジスタが直接使えるデータの置き場とだけわかればよい。

mov rax 1

はraxに1を代入

akeit0

単純な呼び出し(0.3313 ns/ 1.00)

[MethodImpl(MethodImplOptions.NoInlining)]
static bool Compare(int x, int y) => x == y;

public bool CallDirect()
{
    return Compare(1, 2);
}

; MethodCallBenchmarks.CallTypeBenchmark.CallDirect()
       mov       ecx,1
       mov       edx,2
       jmp       qword ptr [7FFEFFFEE118]; MethodCallBenchmarks.CallTypeBenchmark.Compare(Int32, Int32)
; Total bytes of code 16

; MethodCallBenchmarks.CallTypeBenchmark.Compare(Int32, Int32)
       cmp       ecx,edx
       sete      al
       movzx     eax,al
       ret
; Total bytes of code 9

簡単にecxとedxに引数を置いて、比較してboolを返す関数に飛んだ。
以上
これを基準に見ていく。

akeit0

static readonly interface(0.0ns/ 0.0)
順番が前後するけどこれを先にする

public bool CallInterfaceStaticReadOnly() => EqualityComparer<int>.Default.Equals(1, 2);

; MethodCallBenchmarks.CallTypeBenchmark.CallInterfaceStaticReadOnly()
       xor       eax,eax
       ret
; Total bytes of code 3

すごく短いアセンブラ。
xor は両方bitが立っていたら0にするので、なんでも0にできる。
xor はCPU的に0を一番速く用意できる。
ということでこのアセンブラは単にfalseを返す処理だけしている。
.NETのJITはstatic readonlyなinterfaceだと何を呼べばいいかわかるので直接呼び出しにできる。
さらにインライン化して、結果が定数になるなら、定数を返すようにもできる。
最速。

akeit0

static interface(0.0992 ns/0.299)

static IEqualityComparer<int> comparer = EqualityComparer<int>.Default;
[Benchmark]
public bool CallInterface() => comparer.Equals(1, 2);

; MethodCallBenchmarks.CallTypeBenchmark.CallInterface()
       push      rbx
       sub       rsp,20
       mov       rax,2336EC003F8
       mov       rcx,[rax]
       mov       rax,offset MT_System.Collections.Generic.GenericEqualityComparer`1[[System.Int32, System.Private.CoreLib]]
       cmp       [rcx],rax
       jne       short M00_L01
       xor       ebx,ebx
M00_L00:
       movzx     eax,bl
       add       rsp,20
       pop       rbx
       ret
M00_L01:
       mov       r11,7FFEFFB90500
       mov       edx,1
       mov       r8d,2
       call      qword ptr [r11]
       mov       ebx,eax
       jmp       short M00_L00
; Total bytes of code 72

[rax]はraxにあるポインタを読み出している。
そして、JITは賢いのでいつもinterfaceで同じ型がきているとそれを特別扱いしだす。
.NET9のobjectは先頭に型情報を保持するポインタ(TypeHandle)を置いている。
よってそれを比較すると簡単に型の判別が可能。
予想とあってたら、0を返り値に置き、スタックを戻して完了。
間違ってたら、interfaceとしての仮想呼び出しにfallback。

akeit0

事前最適化static interface(1.1049 ns/3.335)

static IEqualityComparer<int> comparer = EqualityComparer<int>.Default;
[MethodImpl(MethodImplOptions.AggressiveOptimization)]
public bool CallInterfaceNoTiered() => comparer.Equals(1, 2);

; MethodCallBenchmarks.CallTypeBenchmark.CallInterfaceNoTiered()
       sub       rsp,28
       test      byte ptr [7FFEFFEC8AB0],1
       je        short M00_L01
M00_L00:
       mov       rcx,1AAFEC003F8
       mov       rcx,[rcx]
       mov       r11,7FFEFFB904A0
       mov       edx,1
       mov       r8d,2
       cmp       [rcx],ecx
       add       rsp,28
       jmp       qword ptr [r11]
M00_L01:
       mov       rcx,offset MT_MethodCallBenchmarks.CallTypeBenchmark
       call      CORINFO_HELP_GET_NONGCSTATIC_BASE
       jmp       short M00_L00
; Total bytes of code 73

見慣れない分岐が来たこれはstatic変数が初期化されているか確認している。
できてなかったら型をCORINFO_HELP_GET_NONGCSTATIC_BASEに渡して初期化。
あとは仮想関数呼び出し。

akeit0

static readonly delagate (0.0000 ns/0.0)
static delagate (0.0000 ns/0.0)

static Func<int, int, bool> comparerStaticFunc = Compare;
static readonly Func<int, int, bool> comparerStaticFuncStaticReadOnly = Compare;
[Benchmark]
public bool CallDelegateStaticReadOnly() => comparerFuncStaticReadOnly(1, 2);

[Benchmark]
public bool CallDelegate() => comparerFunc(1, 2);

; MethodCallBenchmarks.CallTypeBenchmark.CallDelegateStaticReadOnly()
       mov       rax,1F60B400400
       mov       r10,[rax]
       mov       rax,7FFEFFFBA130
       cmp       [r10+18],rax
       jne       short M00_L00
       xor       eax,eax
       ret
M00_L00:
       mov       edx,1
       mov       r8d,2
       mov       rcx,[r10+8]
       jmp       qword ptr [r10+18]
; Total bytes of code 51

; MethodCallBenchmarks.CallTypeBenchmark.CallDelegate()
       mov       rax,21A80000408
       mov       r10,[rax]
       mov       rax,7FFEFFFDA130
       cmp       [r10+18],rax
       jne       short M00_L00
       xor       eax,eax
       ret
M00_L00:
       mov       edx,1
       mov       r8d,2
       mov       rcx,[r10+8]
       jmp       qword ptr [r10+18]
; Total bytes of code 51

速すぎて観測できなかった。static readonly interfaceより速いはずはないので、誤差。
static readonlyの最適化はしれくれなさそう。（Dictionaryみたいな需要がないから？）
delegate内のfield(_methodPtr)に関数ポインタがあるので、それを読み出して定数比較している。
想定通り関数ポインタならそのまま0を返す。
それなければ、delegate内の_targetも読みだして関数呼びだし。

akeit0

事前最適化static delegate(0.9105 ns/2.748)

static Func<int, int, bool> comparerStaticFunc = Compare;
static readonly Func<int, int, bool> comparerStaticFuncStaticReadOnly = Compare;
[MethodImpl(MethodImplOptions.AggressiveOptimization)]
public bool CallDelegateNoTiered() => comparerFunc(1, 2);

; MethodCallBenchmarks.CallTypeBenchmark.CallDelegateNoTiered()
       sub       rsp,28
       test      byte ptr [7FFEFFE98AB0],1
       je        short M00_L01
M00_L00:
       mov       rdx,18FA8400408
       mov       rax,[rdx]
       mov       edx,1
       mov       r8d,2
       mov       rcx,[rax+8]
       add       rsp,28
       jmp       qword ptr [rax+18]
M00_L01:
       mov       rcx,offset MT_MethodCallBenchmarks.CallTypeBenchmark
       call      CORINFO_HELP_GET_NONGCSTATIC_BASE
       jmp       short M00_L00
; Total bytes of code 66

特にいうことはないです。interfaceよりもちょっと速い。

akeit0

static readonly delagate of static method (1.0997 ns/3.319)
static delagate of static method (1.0931 ns ns/3.299)
Profileなしstatic delagate of static method(1.3197 ns/3.983)

static Func<int, int, bool> comparerStaticFunc = Compare;
static readonly Func<int, int, bool> comparerStaticFuncStaticReadOnly = Compare;

[MethodImpl(MethodImplOptions.NoInlining)]
static bool Compare(int x, int y) => x == y;

[Benchmark]
public bool CallStaticMethodDelegate() => comparerStaticFunc(1, 2);

[Benchmark]
[MethodImpl(MethodImplOptions.AggressiveOptimization)]
public bool CallStaticMethodDelegateNoTiered() => comparerStaticFunc(1, 2);

; MethodCallBenchmarks.CallTypeBenchmark.CallStaticMethodDelegateStaticReadOnly()
       mov       rdx,22C2EC00418
       mov       rax,[rdx]
       mov       edx,1
       mov       r8d,2
       mov       rcx,[rax+8]
       jmp       qword ptr [rax+18]
; Total bytes of code 32

; MethodCallBenchmarks.CallTypeBenchmark.CallStaticMethodDelegate()
       mov       rdx,20710C00410
       mov       rax,[rdx]
       mov       edx,1
       mov       r8d,2
       mov       rcx,[rax+8]
       jmp       qword ptr [rax+18]
; Total bytes of code 32

; MethodCallBenchmarks.CallTypeBenchmark.CallStaticMethodDelegateNoTiered()
       sub       rsp,28
       test      byte ptr [7FFEFFEC7B08],1
       je        short M00_L01
M00_L00:
       mov       rdx,1C9E6000410
       mov       rax,[rdx]
       mov       edx,1
       mov       r8d,2
       mov       rcx,[rax+8]
       add       rsp,28
       jmp       qword ptr [rax+18]
M00_L01:
       mov       rcx,offset MT_MethodCallBenchmarks.CallTypeBenchmark
       call      CORINFO_HELP_GET_NONGCSTATIC_BASE
       jmp       short M00_L00
; Total bytes of code 66

前は結構遅かったけど、改善してるな…
Asm的にはCallDelegateNoTieredより速そうだけど、[rax+8]で_target:objectを読み込んでるので、それをstatic関数の呼び出しではそれを無視するために引数の並びなおしが必要。

akeit0

static readonly function pointer (0.9251 ns/2.792)
static function pointer (0.9337 ns/2.818)
Profileなし static function pointer (0.9337 ns/2.787)

static delegate*<int, int, bool> comparerFuncPointer = &Compare;
static readonly delegate*<int, int, bool> comparerFuncPointerStaticReadOnly = &Compare;

public bool CallFunctionPointerStaticReadOnly() => comparerFuncPointerStaticReadOnly(1, 2);

public bool CallFunctionPointer() => comparerFuncPointer(1, 2);

[MethodImpl(MethodImplOptions.AggressiveOptimization)]
public bool CallFunctionPointerNoTiered() => comparerFuncPointer(1, 2);

; MethodCallBenchmarks.CallTypeBenchmark.CallFunctionPointerStaticReadOnly()
       mov       ecx,1
       mov       edx,2
       mov       rax,7FFEFFFBA160
       jmp       rax
; Total bytes of code 23

; MethodCallBenchmarks.CallTypeBenchmark.CallFunctionPointer()
       mov       rax,[7FFEFFB8B0B8]
       mov       ecx,1
       mov       edx,2
       jmp       rax
; Total bytes of code 20

; MethodCallBenchmarks.CallTypeBenchmark.CallFunctionPointerNoTiered()
       sub       rsp,28
       test      byte ptr [7FFEFFEA7B08],1
       je        short M00_L01
M00_L00:
       mov       rax,[7FFEFFB6B0B8]
       mov       ecx,1
       mov       edx,2
       add       rsp,28
       jmp       rax
M00_L01:
       mov       rcx,offset MT_MethodCallBenchmarks.CallTypeBenchmark
       call      CORINFO_HELP_GET_NONGCSTATIC_BASE
       jmp       short M00_L00
; Total bytes of code 54

static readonlyだと定数化してくれるくらい。
interfaceやdelegateみたいなinline化の最適化はしてくれない。
それでいて、delegateの呼び出しと速さは全然変わらない。

akeit0

改めて結果を貼ります。
link

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3194)
13th Gen Intel Core i7-13700F, 1 CPU, 24 logical and 16 physical cores
.NET SDK 9.0.200
  [Host]   : .NET 9.0.2 (9.0.225.6610), X64 RyuJIT AVX2
  ShortRun : .NET 9.0.2 (9.0.225.6610), X64 RyuJIT AVX2

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3

Method	Mean	Error	StdDev	Ratio	Code Size
CallDirect	0.3313 ns	0.0092 ns	0.0005 ns	1.000	25 B
CallInterface	0.0992 ns	0.0069 ns	0.0004 ns	0.299	72 B
CallInterfaceNoTiered	1.1049 ns	0.0590 ns	0.0032 ns	3.335	73 B
CallInterfaceStaticReadOnly	0.0131 ns	0.0554 ns	0.0030 ns	0.040	3 B
CallDelegateStaticReadOnly	0.0000 ns	0.0000 ns	0.0000 ns	0.000	51 B
CallDelegate	0.0000 ns	0.0000 ns	0.0000 ns	0.000	51 B
CallDelegateNoTiered	0.9105 ns	0.0115 ns	0.0006 ns	2.748	66 B
CallStaticMethodDelegateStaticReadOnly	1.0997 ns	0.0162 ns	0.0009 ns	3.319	32 B
CallStaticMethodDelegate	1.0931 ns	0.0230 ns	0.0013 ns	3.299	32 B
CallStaticMethodDelegateNoTiered	1.3197 ns	0.0660 ns	0.0036 ns	3.983	66 B
CallFunctionPointerStaticReadOnly	0.9251 ns	0.0898 ns	0.0049 ns	2.792	23 B
CallFunctionPointer	0.9337 ns	0.1004 ns	0.0055 ns	2.818	20 B
CallFunctionPointerNoTiered	0.9235 ns	0.0066 ns	0.0004 ns	2.787	54 B

akeit0

.NET6での結果はこれ
link

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3194)
13th Gen Intel Core i7-13700F, 1 CPU, 24 logical and 16 physical cores
.NET SDK 9.0.200
  [Host]   : .NET 6.0.36 (6.0.3624.51421), X64 RyuJIT AVX2
  ShortRun : .NET 6.0.36 (6.0.3624.51421), X64 RyuJIT AVX2

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3

Method	Mean	Error	StdDev	Median	Ratio	RatioSD	Code Size
CallDirect	0.5507 ns	0.9892 ns	0.0542 ns	0.5768 ns	1.01	0.13	24 B
CallInterface	1.2813 ns	0.9851 ns	0.0540 ns	1.2591 ns	2.34	0.23	46 B
CallInterfaceNoTiered	2.2386 ns	0.2921 ns	0.0160 ns	2.2379 ns	4.09	0.37	74 B
CallInterfaceStaticReadOnly	0.0282 ns	0.7509 ns	0.0412 ns	0.0093 ns	0.05	0.07	3 B
CallDelegateStaticReadOnly	0.1311 ns	0.0687 ns	0.0038 ns	0.1302 ns	0.24	0.02	35 B
CallDelegate	0.5012 ns	0.0981 ns	0.0054 ns	0.4981 ns	0.92	0.08	35 B
CallDelegateNoTiered	2.1521 ns	0.5494 ns	0.0301 ns	2.1672 ns	3.93	0.36	63 B
CallStaticMethodDelegateStaticReadOnly	2.0709 ns	3.0173 ns	0.1654 ns	2.0107 ns	3.79	0.43	35 B
CallStaticMethodDelegate	1.9958 ns	0.7658 ns	0.0420 ns	2.0017 ns	3.65	0.34	35 B
CallStaticMethodDelegateNoTiered	2.3247 ns	0.5166 ns	0.0283 ns	2.3395 ns	4.25	0.39	63 B
CallFunctionPointerStaticReadOnly	0.1454 ns	0.1939 ns	0.0106 ns	0.1421 ns	0.27	0.03	32 B
CallFunctionPointer	1.7347 ns	1.5508 ns	0.0850 ns	1.7800 ns	3.17	0.32	29 B
CallFunctionPointerNoTiered	1.9360 ns	2.0783 ns	0.1139 ns	1.9135 ns	3.54	0.37	57 B

akeit0

ちなみに

; MethodCallBenchmarks.CallTypeBenchmark.CallDirect()
       mov       ecx,1
       mov       edx,2
       jmp       qword ptr [7FFEFFFEE118]; MethodCallBenchmarks.CallTypeBenchmark.Compare(Int32, Int32)
; Total bytes of code 16

のように直接呼出しでも[7FFEFFFEE118]のように一度ポインタの読み出しを挟んでいるのは、
JITで関数のアセンブラが書き換わるので、そこに対応するためです。
なのでCompareの関数ポインタも

jmp qword ptr [7FFEFFFEE118]

というふうになっています。