🐣

Juliaの生成関数(@generated)の使い方とパフォーマンス

2021/12/19に公開

2件

これはJulia Advent Calendar 2021の20日目の記事です。

実行環境は以下です。

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 7 2700X Eight-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, znver1)

生成関数ってなに？

@generatedマクロを使って定義される関数のことを生成関数と呼びます！
公式ドキュメントはこちら

最初の例

定義

まずは具体例から見ていきましょう。

julia> f(x::Real) = x+2  # 実数には+2する
f (generic function with 1 method)

julia> f(x::Integer) = x+1  # ただし整数なら+1する
f (generic function with 2 methods)

julia> @generated function g(x::Real)
           if x <: Integer
               return :(x+1)  # 整数なら+1する
           else
               return :(x+2)  # それ以外の実数で+2する
           end
       end
g (generic function with 1 method)

julia> function h(x::Real)  # Juliaっぽくない書き方
           if x isa Integer
               return x+1  # 整数なら+1する
           else
               return x+2  # それ以外の実数で+2する
           end
       end
h (generic function with 1 method)

それぞれ以下のような定義になっています。

f: 多重ディスパッチによる定義
- 型に応じて戻り値を切り替える方法として最も由緒正しい。
g: @generatedマクロによる定義
- 詳細は後述しますが、if文によって型の分岐を入れています。
h: 型をif文で分岐させて定義
- gの定義と同じように、愚直にifで型の分岐をしています。

これらの関数が同じ動作をすることを確認しましょう。

julia> f(1), g(1), h(1)
(2, 2, 2)

julia> f(1.0), g(1.0), h(1.0)
(3.0, 3.0, 3.0)

意図通り、整数なら+1、それ以外の実数なら+2していますね。

詳細

では、これらの動作は厳密に同じでしょうか？
@code_loweredマクロで確認してみましょう。

julia> @code_lowered f(1.0)
CodeInfo(
1 ─ %1 = x + 2
└──      return %1
)

julia> @code_lowered g(1.0)
CodeInfo(
    @ REPL[3]:1 within `g`
   ┌ @ REPL[3] within `macro expansion`
1 ─│ %1 = x + 2
└──│      return %1
   └
)

julia> @code_lowered h(1.0)
CodeInfo(
1 ─ %1 = x isa Main.Integer
└──      goto #3 if not %1
2 ─ %3 = x + 1
└──      return %3
3 ─ %5 = x + 2
└──      return %5
)

fとgは同一の低レベル表現になりましたが、hには不要なInteger判定が残ってしまっています。愚直なifによる型の分岐を避けるべき理由がこれで、hではコンパイル時の最適化が効きにくくなってしまいます。多重ディスパッチによる定義fがパフォーマンスの面からは最も適切です。

生成関数の定義gで多重ディスパッチの定義fと同等の表現になったのはどういう理由でしょうか？

生成関数gの定義を確認しましょう。

@generated function g(x::Real)
    if x <: Integer
        return :(x+1)
    else
        return :(x+2)
    end
end

少し紛らわしいですが、変数xには3つの役割があります。

関数の引数のxはRealのインスタンスとして扱われている
- x::Real
- 多重ディスパッチのための記述。
関数定義の内部ではxはRealの部分型として扱われている
- x <: Integer
- 型の分岐のために使われる。
戻り値の式の内部ではxはSymbolとして扱われている
- 戻り値はExprになっている。(:(x+1)や:(x+2)など)
- ここでのxは:()で囲まれて一つの式Exprになっているので、Symbolになっている。

JuliaはJITコンパイルなので、関数の初回実行時にコンパイルが行われます。
@generatedマクロを使って定義された関数では、その初回実行時に関数を生成するような仕組みになっています。これが生成関数(generated function)と呼ばれている理由です。

関数gにprintlnを付け加えて、初回実行時のみにメッセージが表示されるようにしてみましょう。

julia> @generated function g_print(x::Real)
           println("型$(x)での初回実行！")
           if x <: Integer
               return :(x+1)
           else
               return :(x+2)
           end
       end
g_print (generic function with 1 method)

julia> g_print(1)
型Int64での初回実行！
2

julia> g_print(1)
2

julia> g_print(1)
2

julia> g_print(1.0)
型Float64での初回実行！
3.0

julia> g_print(1.0)
3.0

ちゃんと初回実行時のみにメッセージが表示されましたね！

もう少し複雑な例

定義

型の引数によって挙動を変えたいなら、普通に多重ディスパッチを使えば済む話です。最初の例は簡単すぎました。
もう少し複雑な例を挙げたいと思います。

julia> using LinearAlgebra, BenchmarkTools

julia> function dot123(s::NTuple{N,Int}) where N
           v = 1:N
           return dot(s,v)
       end
dot123 (generic function with 1 method)

julia> @generated function dot123_gen(s::NTuple{N,Int}) where N
           term(i) = :(s[$i]*$i)
           ex = Expr(:call, :+, [term(i) for i in 1:N]...)
           return ex
       end
dot123_gen (generic function with 1 method)

これは与えられたN項のIntのタプルをベクトルだと思って、別のベクトル1:Nと内積を取るものになります。

dot123はLinearAlgebra.jlを使ってdotで内積を計算するもので、dot123_genは生成関数によって内積を計算するものになります。
例として $(-1,4,8,2,0,-3)$ との内積

\begin{aligned} &(-1,4,8,2,0,-3) \cdot (1,2,3,4,5,6) \\ ={}&-1\times 1 +4\times 2 +8\times 3 +2\times 4 +0\times 5 +-3\times 6 \\ ={}&21\end{aligned}

を計算してみます。

julia> dot123((-1,4,8,2,0,-3))
21

julia> dot123_gen((-1,4,8,2,0,-3))
21

ちゃんと一致していましたね。

ベンチマーク

ベンチマークを取ってみましょう。

julia> @benchmark dot123((-1,4,8,2,0,-3))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.562 ns … 8.697 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.583 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.636 ns ± 0.133 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █              ▃                                          
  ▆█▆▂▁▁▁▁▁▁▁▁▁▁▁██▂▂▂▁▁▁▁▂▂▂▂▁▁▂▂▂▂▂▁▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▂
  1.56 ns        Histogram: frequency by time       2.05 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark dot123_gen((-1,4,8,2,0,-3))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  0.020 ns … 0.031 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     0.020 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   0.022 ns ± 0.004 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █    ▃                                             ▆      ▁
  █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▇ █
  0.02 ns     Histogram: log(frequency) by time    0.031 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

イェイ！生成関数の方がパフォーマンス良いですね！！
75倍くらい速くなっています。

こちらでも@code_loweredマクロで確認してみましょう。

julia> @code_lowered dot123((-1,4,8,2,0,-3))
CodeInfo(
1 ─      v = s
│        w = 1:$(Expr(:static_parameter, 1))
│   %3 = Main.dot(v, w)
└──      return %3
)

julia> @code_lowered dot123_gen((-1,4,8,2,0,-3))
CodeInfo(
    @ REPL[3]:1 within `dot123_gen`
   ┌ @ REPL[3] within `macro expansion`
1 ─│ %1  = Base.getindex(s, 1)
│  │ %2  = %1 * 1
│  │ %3  = Base.getindex(s, 2)
│  │ %4  = %3 * 2
│  │ %5  = Base.getindex(s, 3)
│  │ %6  = %5 * 3
│  │ %7  = Base.getindex(s, 4)
│  │ %8  = %7 * 4
│  │ %9  = Base.getindex(s, 5)
│  │ %10 = %9 * 5
│  │ %11 = Base.getindex(s, 6)
│  │ %12 = %11 * 6
│  │ %13 = %2 + %4 + %6 + %8 + %10 + %12
└──│       return %13
   └
)

dot123の方では内積の計算がdotに丸投げされていますが、dot123_genの方では内積の計算がすべて展開されています。これによって高速化できたように思えます。

StaticArrays.jlを使った実装

そういえば、StaticArrays.jlは固定長配列を扱うためのパッケージで、内部的にベクトルはタプルとして表現してされているのでした。

ややこしい@generatedマクロを使わずとも、StaticArrays.jlで十分高速なら嬉しいですよね。

julia> using StaticArrays

julia> dot123_sa(s::NTuple{N,Int}) where N = dot(SVector(s), StaticArrays.SUnitRange{1,N}())
dot123_sa (generic function with 1 method)

julia> dot123_sa((-1,4,8,2,0,-3))
21

ベンチマークを取ってみましょう．

julia> @benchmark dot123_sa((-1,4,8,2,0,-3))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.572 ns … 11.361 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.603 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.602 ns ±  0.125 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                              ▃             █                 
  ▂▃▁▁▁▁▁▁▁▁▁▁▁▂▁█▁▁▁▁▁▁▁▁▁▁▁▁█▁▂▁▁▁▁▁▁▁▁▁▁▁█▁▂▁▁▁▁▁▁▁▁▁▁▁▄▂ ▂
  1.57 ns        Histogram: frequency by time        1.61 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

あれれ、速くないですね…。念のため低レベル表現を確認してみましょう。

julia> @code_lowered dot123_sa((-1,4,8,2,0,-3))
CodeInfo(
1 ─ %1 = Core.apply_type(Main.SVector, $(Expr(:static_parameter, 1)))
│   %2 = (%1)(s)
│   %3 = StaticArrays.SUnitRange
│   %4 = (%3)(1, $(Expr(:static_parameter, 1)))
│   %5 = Main.dot(%2, %4)
└──      return %5
)

dot123_saはdot123と同様に、低レベル表現では内積の計算が展開されていないですね。

では、LLVM中間表現ではどのようになっているでしょうか？

julia> @code_llvm dot123_gen((-1,4,8,2,0,-3))
;  @ REPL[3]:1 within `dot123_gen`
define i64 @julia_dot123_gen_1594([6 x i64]* nocapture nonnull readonly align 8 dereferenceable(48) %0) #0 {
top:
; ┌ @ REPL[3] within `macro expansion`
; │┌ @ tuple.jl:29 within `getindex`
    %1 = getelementptr inbounds [6 x i64], [6 x i64]* %0, i64 0, i64 0
; │└
; │┌ @ int.jl:88 within `*`
    %2 = load i64, i64* %1, align 8
(長いので省略)

julia> @code_llvm dot123_sa((-1,4,8,2,0,-3))
(長いので省略)

出力された中間表現では;以下がコメントなので削除できます。適当にインデントを揃えれば以下のようになります。

[生成関数dot123_genの出力]

julia> @code_llvm dot123_gen((-1,4,8,2,0,-3))
define i64 @julia_dot123_gen_1594([6 x i64]* nocapture nonnull readonly align 8 dereferenceable(48) %0) #0 {
top:
    %1 = getelementptr inbounds [6 x i64], [6 x i64]* %0, i64 0, i64 0
    %2 = load i64, i64* %1, align 8
    %3 = getelementptr inbounds [6 x i64], [6 x i64]* %0, i64 0, i64 1
    %4 = load i64, i64* %3, align 8
    %5 = shl i64 %4, 1
    %6 = getelementptr inbounds [6 x i64], [6 x i64]* %0, i64 0, i64 2
    %7 = load i64, i64* %6, align 8
    %8 = mul i64 %7, 3
    %9 = getelementptr inbounds [6 x i64], [6 x i64]* %0, i64 0, i64 3
    %10 = load i64, i64* %9, align 8
    %11 = shl i64 %10, 2
    %12 = getelementptr inbounds [6 x i64], [6 x i64]* %0, i64 0, i64 4
    %13 = load i64, i64* %12, align 8
    %14 = mul i64 %13, 5
    %15 = getelementptr inbounds [6 x i64], [6 x i64]* %0, i64 0, i64 5
    %16 = load i64, i64* %15, align 8
    %17 = mul i64 %16, 6
    %18 = add i64 %5, %2
    %19 = add i64 %18, %8
    %20 = add i64 %19, %11
    %21 = add i64 %20, %14
    %22 = add i64 %21, %17
    ret i64 %22
}

[StaticArrays.jlを使ったdot123_saの出力]

julia> @code_llvm dot123_sa((-1,4,8,2,0,-3))
define i64 @julia_dot123_sa_1596([6 x i64]* nocapture nonnull readonly align 8 dereferenceable(48) %0) #0 {
top:
    %1 = getelementptr inbounds [6 x i64], [6 x i64]* %0, i64 0, i64 0
    %2 = load i64, i64* %1, align 8
    %3 = getelementptr inbounds [6 x i64], [6 x i64]* %0, i64 0, i64 1
    %4 = load i64, i64* %3, align 8
    %5 = shl i64 %4, 1
    %6 = add i64 %5, %2
    %7 = getelementptr inbounds [6 x i64], [6 x i64]* %0, i64 0, i64 2
    %8 = load i64, i64* %7, align 8
    %9 = mul i64 %8, 3
    %10 = add i64 %9, %6
    %11 = getelementptr inbounds [6 x i64], [6 x i64]* %0, i64 0, i64 3
    %12 = load i64, i64* %11, align 8
    %13 = shl i64 %12, 2
    %14 = add i64 %13, %10
    %15 = getelementptr inbounds [6 x i64], [6 x i64]* %0, i64 0, i64 4
    %16 = load i64, i64* %15, align 8
    %17 = mul i64 %16, 5
    %18 = add i64 %17, %14
    %19 = getelementptr inbounds [6 x i64], [6 x i64]* %0, i64 0, i64 5
    %20 = load i64, i64* %19, align 8
    %21 = mul i64 %20, 6
    %22 = add i64 %21, %18
    ret i64 %22
}

LLVM中間表現の読み方に筆者は詳しくないのですが、

getelementptrの周辺はタプルの要素を取ってくる操作
mul, addはタプルの要素を取ってくる操作

と思えば特に難しくなく、dot123_genとdot123_saはどちらもほぼ同じLLVM中間表現になっています。両者で異なるのはタプルの要素取得の順番の違いのみで、処理内容は同一のようです。

あれれれ、、LLVM中間表現がほぼ同じなのに実行速度が75倍程度も異なるのはかなり不思議ですね。パフォーマンスの差異はどこから来ているのでしょうか？

確証は無いですが、上記の「ほぼ同じ」と考えていたところに誤りがあるような気がしています。LLVM中間表現が処理内容として同一に見えていても、CPU上のキャッシュの最適化の観点から考えれば同一視するのは間違っていたかも知れません。

これについて行った実験を次節では紹介します。

ベンチマークの差異に関する実験

通常の関数の実行時間

現状のベンチマークは以下のようになっています。

dot123 1.6ns程度
dot123_gen 0.02ns程度
dot123_sa 1.6ns程度

@generatedマクロの性質が影響して、dot123_genが速くなっている可能性があるかと予想しましたが、次の実行結果を見る限りは@generatedマクロ無しでも計測結果が速いこともあるようですね。

julia> dot123_plain(s::NTuple{6,Int}) = s[1]*1+s[2]*2+s[3]*3+s[4]*4+s[5]*5+s[6]*6
dot123_plain (generic function with 1 method)

julia> dot123_plain((-1,4,8,2,0,-3))
21

julia> @benchmark dot123_plain((-1,4,8,2,0,-3))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  0.020 ns … 3.877 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     0.020 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   0.023 ns ± 0.039 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █    ▃                                             ▆    ▁ ▁
  █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █
  0.02 ns     Histogram: log(frequency) by time    0.031 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

繰り返した場合の実行時間

dot123_genとdot123_saについて1000回ずつ繰り返してみましょう。

julia> @benchmark sum(dot123_gen((-1,4,8,2,0,-i)) for i in 1:1000)
BenchmarkTools.Trial: 10000 samples with 646 evaluations.
 Range (min … max):  190.406 ns … 586.574 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     194.051 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   195.766 ns ±  10.211 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂█▇▇▆▇▆▂▂▂▂▂▁                                                 ▂
  █████████████▇█▆▇▆▅▅▄▅▃▅▄▄▄▄▅▅▆▅▄▆▅▄▄▁▃▄▃▃▁▁▃▅▅▃▃▃▄▆▄▅▆▅▆▆▇▇█ █
  190 ns        Histogram: log(frequency) by time        254 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sum(dot123_sa((-1,4,8,2,0,-i)) for i in 1:1000)
BenchmarkTools.Trial: 10000 samples with 656 evaluations.
 Range (min … max):  190.421 ns … 408.410 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     194.316 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   198.860 ns ±  18.472 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▂                                                            
  ▆██▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂ ▂
  190 ns           Histogram: frequency by time          303 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

dot123_genの方が速いですが、平均実行時間は195.766 ns/198.860 nsの僅かな差でしかなく、当初ような75倍の差はありませんでした。

一回あたりのdot123_gen/dot123_saの実行時間は大雑把に割り算して0.195 ns/0.198 nsになります。一度のみの実行では0.022 ns/1.6 ns程度でしたから、dot123_gen/dot123_saはそれぞれ低速化/高速化したことになります。

dot123_saが高速化されたのは「ベンチマーク時のオーバーヘッドが減った」or「複数回呼び出しに関する最適化が行われた」と予想できます。dot123_genが低速化されたのはまだ謎ですね。

100回ずつ繰り返すとどうなるでしょうか？

julia> @benchmark sum(dot123_gen((-1,4,8,2,0,-i)) for i in 1:100)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.222 ns … 13.936 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.232 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.261 ns ±  0.160 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█ ▄                                         ▃ ▂ ▁▁ ▁ ▁    ▁
  ██▁█▁▄▁▁▄▁▄▁▁▁▃▃▁▄▁▄▁▄▁▁▄▁▃▁▃▄▁▁▁▄▁▃▅▁▅▁▄▁▅█▁█▁█▁██▁█▁█▁█▇ █
  1.22 ns      Histogram: log(frequency) by time     1.55 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sum(dot123_sa((-1,4,8,2,0,-i)) for i in 1:100)
BenchmarkTools.Trial: 10000 samples with 997 evaluations.
 Range (min … max):  19.595 ns … 76.523 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     20.430 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.026 ns ±  1.725 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃▃▂ ██▆▆▃       ▁ ▆▆                            ▂    ▁     ▂
  ▇█████████▄▄▄▄▂▄▄████▆▅▄▄▄▄▅▆▄▅▅▅▆▄▆▅▅▅▅▅▅▄▅▄▂▄▆▄█▄▃▆▇█▄▅▅▇ █
  19.6 ns      Histogram: log(frequency) by time      26.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

明らかにdot123_genの方がdot123_saよりも速いですね！(17倍程度)

dot123_genを使った100回計算の平均実行時間が1.261 nsだったので一回あたり0.012 nsになります。当初の0.022 nsに比べて高速になっているようです。つまり、繰り返し回数が多くなるとdot123_genの一回あたりの実行時間が増加するようです。

平均だけ見たいのでStatistics.meanを使いましょう。

julia> mean(@benchmark sum(dot123_gen((-1,4,8,2,0,-i)) for i in 1:100))
BenchmarkTools.TrialEstimate: 
  time:             1.249 ns
  gctime:           0.000 ns (0.00%)
  memory:           0 bytes
  allocs:           0

julia> mean(@benchmark sum(dot123_gen((-1,4,8,2,0,-i)) for i in 1:101))
BenchmarkTools.TrialEstimate: 
  time:             1.234 ns
  gctime:           0.000 ns (0.00%)
  memory:           0 bytes
  allocs:           0

julia> mean(@benchmark sum(dot123_gen((-1,4,8,2,0,-i)) for i in 1:102))
BenchmarkTools.TrialEstimate: 
  time:             1.226 ns
  gctime:           0.000 ns (0.00%)
  memory:           0 bytes
  allocs:           0

julia> mean(@benchmark sum(dot123_gen((-1,4,8,2,0,-i)) for i in 1:103))
BenchmarkTools.TrialEstimate: 
  time:             53.039 ns
  gctime:           0.000 ns (0.00%)
  memory:           0 bytes
  allocs:           0

julia> mean(@benchmark sum(dot123_gen((-1,4,8,2,0,-i)) for i in 1:104))
BenchmarkTools.TrialEstimate: 
  time:             52.949 ns
  gctime:           0.000 ns (0.00%)
  memory:           0 bytes
  allocs:           0

繰り返し回数が102回を超えたところから急激に実行時間が増えました。

まだ確証は無いですが、上記の「ほぼ同じ」と考えていたところに誤りがあるような気がしています。LLVM中間表現が処理内容として同一に見えていても、CPU上のキャッシュの最適化の観点から考えれば同一視するのは間違っていたかも知れません。

前節で上記のように書いていたのはこの実行結果を踏まえたものになります。

処理のサイズが増えて不連続^[1]に実行時間が変わる、というのはCPUキャッシュが使えるか否かに恐らく関連しているはずです。
LLVM中間表現が同値に見えても、処理の順番によってはキャッシュに載せやすいかどうかが変わることがあり、これがdot123_genとdot123_saの実行時間の差になったのだと思います。^[2]

結論として、@generatedマクロを使った生成関数dot123_genの方がStaticArrays.SVectorを使った関数dot123_saに比べて高速になっていました。ただ、@generatedマクロを使わずに済む方が嬉しく、一般ユーザーがCPUのキャッシュまではあまり考えたくないとも思います。将来的にはJulia本体がより強力な最適化を行うようになってdot123_genとdot123_saが同等のパフォーマンスを出すようになっているかも知れませんね。

1要素タプルの場合

(2021/12/20更新)
試しに1要素タプル(2,)を入れてみたところ、dot123_genとdot123_saとで完全に同一のLLVM中間表現が得られました。

julia> @code_llvm dot123_gen((2,))
;  @ REPL[6]:1 within `dot123_gen`
define i64 @julia_dot123_gen_735([1 x i64]* nocapture nonnull readonly align 8 dereferenceable(8) %0) #0 {
top:
; ┌ @ REPL[6] within `macro expansion`
; │┌ @ tuple.jl:29 within `getindex`
    %1 = getelementptr inbounds [1 x i64], [1 x i64]* %0, i64 0, i64 0
; │└
; │┌ @ int.jl:88 within `*`
    %2 = load i64, i64* %1, align 8
; │└
   ret i64 %2
; └
}

julia> @code_llvm dot123_sa((2,))
;  @ REPL[2]:1 within `dot123_sa`
define i64 @julia_dot123_sa_737([1 x i64]* nocapture nonnull readonly align 8 dereferenceable(8) %0) #0 {
top:
  %1 = getelementptr inbounds [1 x i64], [1 x i64]* %0, i64 0, i64 0
  %2 = load i64, i64* %1, align 8
  ret i64 %2
}

しかし、依然として以下のように速度の差はありました。

julia> @benchmark dot123_gen((2,))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  0.020 ns … 0.031 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     0.020 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   0.022 ns ± 0.004 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █    ▃                                             ▆      ▁
  █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █
  0.02 ns     Histogram: log(frequency) by time    0.031 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark dot123_sa((2,))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.202 ns … 7.755 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.222 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.223 ns ± 0.080 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                      █                      
  ▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂ ▂
  1.2 ns         Histogram: frequency by time       1.23 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

「繰り返し回数の少ない場合にdot123_genでキャッシュが効いてる」というのは多分正しいと思いたいですが、完全によく分からなくなってきました。

詳しい方がいればコメントで教えていただきたいです…！

生成関数の使いどき

無限個の多重ディスパッチ

dot123_genにおいてはNTuple{N,Int}のそれぞれのNに多重ディスパッチが定義できました。
型パラメータのNは整数で、可算無限個の多重ディスパッチが定義できたことになります。^[3]

他の整数を使ったパラメトリック型にはArray{Float64, N}, SVector{N,Float64}などがあり、これらのパラメータNに応じて関数を定義したいときにも生成関数が便利です。

生成関数を使うかの判断は以下のように考えればOKです。

普通の型の分岐であれば、最初の例(f, g, h)でのfのように多重ディスパッチを使えばOK

メソッドの数が高々有限個であっても、多重ディスパッチを使えば良い。

有限個のパラメータに対して多重ディスパッチを定義する際には@evalマクロを使うことも可能です。

julia> for N in 1:10
           term(i) = :(s[$i]*$i)
           ex = Expr(:call, :+, [term(i) for i in 1:N]...)
           @eval function dot123_eval(s::NTuple{$N,Int})
               $ex
           end
       end

julia> dot123_eval((-1,4,8,2,0,-3))
21

julia> dot123_eval
dot123_eval (generic function with 10 methods)

無限個のメソッド^[4]が必要になれば、自前で多重ディスパッチを定義できない。
- ここで生成関数が便利！

実行初回のみに何らかの処理をしたい場合

生成関数を使えば、g_printの例で見たように、初回のみ実行する処理を与えることができます。具体的な用途は思いつかないですが、便利なことがあるかも知れません。

この記述に関して誤りがありました。詳細はantimon2さんからのコメントをご覧ください！

参考文献

Understanding generated functions
- Julia Discourceでの生成関数についての質問。
- 回答が分かりやすい。
Metaprogramming
- Juliaの公式ドキュメント。
- 生成関数だけでなく、メタプログラミング全般について記載がある。

脚注

繰り返し回数は整数なのでそもそも通常の連続関数として考えられないですが、意味は伝わると思います。 ↩︎
他のCPU環境で試した場合には本記事のベンチマークとは異なるものが得られると思います。 ↩︎
コンピュータ上で扱う整数なので厳密には嘘です。 ↩︎
厳密には生成関数で作られるメソッドは1つだけです。型に応じて選ばれるコードが無限個ある、という意味で無限個のメソッドと書いていました。 ↩︎

GitHubで編集を提案

Discussion

antimon2

まだ誰からもツッコミがないので一応。

実行初回のみに何らかの処理をしたい場合

これはNGです。つまり「実行初回にのみ何らかの処理をするという動作を期待して生成関数を利用する」のは正しくない用途です。

参考文献（公式ドキュメント）の

Don't copy these examples!

と書かれているあたりに、理由として以下が記述されています。

the foo function has side-effects (the call to Core.println), and it is undefined exactly when, how often or how many times these side-effects will occur

つまり（通常の利用ではたまたま）地の文の処理は最初の1回しか実行されないのですが、それは最適化の結果二度と再生成がされなかった場合だけで、ひょっとしたら何らかの拍子に再生成が走ることがありその時は再び実行されることもある、しかもそれは完全に非決定的である（つまり初回のみと言う保証はなく、いつ何度実行されるか分からない）からです。

あまり意味のない例ですが、例えば @code_lowered を併用すると毎回「初回にのみ実行される（ことを期待して記述した）コード」が実行されます。

julia> @generated function f(x)
           Core.println("Run for the type $x")
           rand(1:10)
       end
f (generic function with 1 method)

julia> @code_lowered f(1)
Run for the type Int64
CodeInfo(
    @ REPL[1]:1 within `f`
   ┌ @ REPL[1]:1 within `macro expansion`
1 ─│     return 3
   └
)

julia> @code_lowered f(1)
Run for the type Int64
CodeInfo(
    @ REPL[1]:1 within `f`
   ┌ @ REPL[1]:1 within `macro expansion`
1 ─│     return 8
   └
)

Hyrodium

コメント・コード例ありがとうございます！ドキュメント見落としていました。。記事本文を修正いたしました。