🚅

InlineIfLambdaを用いて効率の良いコンピュテーション式ビルダを作成する

2022/09/28に公開

はじめに

F#で,taskコンピュテーション式が実装されました.このtask{}を実装するために,効率的な状態機械を生成するためのFS-1087が実装されました.それ以外にも,InlineIfLambda属性(FS-1098)が新しく登場し,task{}のような状態機械が必要なコンピュテーション式だけでなく,状態機械が不要なコンピュテーション式も効率よく実装できるようになりました.

今回は,Optionコンピュテーション式のビルダをInlineIfLambdaを用いた方法とそうでない方法の2通りで実装し,パフォーマンスを測ってみます.

InlineIfLambdaを用いないOptionコンピュテーション式

ぜくるさんの記事「Applicative Computation Expressions」のようにコンピュテーション式で実行する処理を書いてみます.

type OldOptionBuilder () =
    member inline __.Bind<'T, 'T2>(x : 'T voption, f : 'T -> 'T2 voption) : 'T2 voption =
        ValueOption.bind f x // let!
    member inline __.BindReturn<'T, 'T2>(x : 'T voption, f : 'T -> 'T2) : 'T2 voption =
        ValueOption.map f x // let! .. return
    member inline __.Return<'T>(x : 'T) : 'T voption =
        ValueSome x // return
    member inline __.MergeSources<'T, 'T2>(x1 : 'T voption, x2 : 'T2 voption) : struct('T * 'T2) voption =
        ValueOption.map2 (fun x y -> struct(x, y)) x1 x2 // and!

これを用いて,

let! v1 = ValueSome 8
let! v2 = ValueSome 4
and! v3 = ValueSome 2
return v1 + v2 + v3

という式をコンパイルしてみると

ValueOption.bind (fun v1 -> ValueOption.map (fun v2 v3 -> v1 + v2 + v3) (ValueOption.map2 (fun v2 v3 -> struct(v2, v3)) (ValueSome 4) (ValueSome 2))) (ValueSome 8)

のようなコードが生成されます.
コードを整形しているので,アロケーションやメソッド呼び出しが無さそう(実行時にインライン展開されそう)ですが,v1についてクロージャが生成されてしまうように,アロケーションやメソッド呼び出しがけっこう起こります.

InlineIfLambdaを用いたOptionコンピュテーション式

type OptionCode<'T> = unit -> 'T voption
type NewOptionBuilder () =
    member inline __.Run<'T>([<InlineIfLambda>]code : 'T OptionCode) : 'T voption = code ()
    member inline __.Bind<'T, 'T2>(x : 'T voption, [<InlineIfLambda>] code : 'T -> 'T2 OptionCode) : 'T2 OptionCode = 
        (fun () -> match x with
            | ValueNone -> ValueNone
            | ValueSome v -> (code v) ()) // let!
    member inline __.BindReturn<'T, 'T2>(x : 'T voption, [<InlineIfLambda>] f : 'T -> 'T2) : 'T2 OptionCode =
        (fun () -> match x with
            | ValueNone -> ValueNone
            | ValueSome v -> ValueSome (f v)) // let! .. return
    member inline __.Return<'T>(x : 'T) : 'T OptionCode = 
        (fun () -> ValueSome x) // return
    member inline __.MergeSources<'T, 'T2>(x1 : 'T voption, x2 : 'T2 voption) : struct('T * 'T2) voption =
        if x1.IsSome && x2.IsSome  then ValueSome struct(x1.Value, x2.Value) else ValueNone // and!

OptionCode<'T>は関数であり,前のバージョンでvoptionとしていたところをOptionCodeに置き換えました.(MergeSources以外)voptionでも効率の良いコードを生成しますが,F#コンパイラのテストを見ると,関数を返す形になっているので,そうしてみます.(MergeSourcesはエラーになりました.)
そして,Bindの引数codeにはInlineIfLambda属性を付与し,関数適用のところを(code v) ()としています.ここで注意したいのが,(code v ())ではインライン展開されません.F#コンパイラは後者の書き方だと真面目にcode.Invoke(v, new unit())のようなプログラムを生成しようとして,最適化しないようです.
ValueOption.mapなどのFSharp.Coreにある関数を使いたくなるところですが,残念ながらこれら関数の引数にInlineIfLambda属性がついていません.そのため,使ってしまうとインライン展開されず,効率の良いコード生成ができなくなります.残念.
OptionCodeを返すようにしたので.Runメソッドを定義しています.このメソッドが無いと,コンピュテーション式は単にOptionCode<'T>を返却してしまいます.アロケーションが発生してしまいますし,何より意図した動作ではありません.

この新しい方法で書くと,前の例は,次のようなILのプログラムを出力します.(意訳)

match ValueSome 8 with
| ValueSome v1 ->
  match struct(ValueSome 4, ValueSome 2) with
  | struct(ValueSome v2, ValueSome v3) -> 
    match ValueSome struct(v2, v3) with
    | ValueSome struct(v2, v3) -> ValueSome (v1 + v2 + v3)
    | ValueNone -> ValueNone
  | _ -> ValueNone
| ValueNone -> ValueNone

ValueTuple, ValueOption関連以外の関数呼び出しが無くなり,ヒープへのアロケーションの無いプログラムが生成されます.

ベンチマーク

ソースコード: https://gist.github.com/GnicoJP/a53479b7dbadf739b8ef98df6bfb6371

環境

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
AMD Ryzen 7 PRO 5850U with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.401
  [Host]     : .NET 6.0.9 (6.0.922.41905), X64 RyuJIT DEBUG
  DefaultJob : .NET 6.0.9 (6.0.922.41905), X64 RyuJIT

実行するコンピュテーション式

let mutable res = 0
for i in 1..10000000 do
    let a = someOption {
		let! a = if i % 1 = 0 then ValueSome 1 else ValueNone
		let! b = if i % 2 = 0 then ValueSome 2 else ValueNone
		let! c = if i % 3 = 0 then ValueSome 3 else ValueNone
		let! d = if i % 4 = 0 then ValueSome 4 else ValueNone
		let! e = if i % 5 = 0 then ValueSome 5 else ValueNone
		res <- a + b + c + d + e + res
		return a + b + c + d + e
	    }
    a |> ignore
res

つまり,iを1から10^7まで回して,1~5の約数(60の約数)であれば,resに15が加算されます.

F#5.0で登場したand!を用いるバージョンも用意しました.

let mutable res = 0
for i in 1..10000000 do
    let a = someOption {
		let! a = if i % 1 = 0 then ValueSome 1 else ValueNone
		and! b = if i % 2 = 0 then ValueSome 2 else ValueNone
		and! c = if i % 3 = 0 then ValueSome 3 else ValueNone
		and! d = if i % 4 = 0 then ValueSome 4 else ValueNone
		and! e = if i % 5 = 0 then ValueSome 5 else ValueNone
		res <- a + b + c + d + e + res
		return a + b + c + d + e
	    }
    a |> ignore
res

作成したビルダ

まず,従来のビルダは次の通りです.

  • old : let!を用いた.
  • oldAnd : and!を用いた.MergeSourcesのみを実装した.
  • oldAnd5 : and!を用いた.MergeSources5, Bind5Returnまで実装した.

そして,効率的なビルダは次の通りです.

  • efficient : let!を用いた.OptionCode<'T>を返す実装.
  • efficient2 : let!を用いた.voption<'T>を返す実装.
  • efficientAnd : and!を用いた.MergeSources, BindReturnのみを実装した.
  • efficientAnd5 : and!を用いた.MergeSources5, Bind5Returnまで実装した.

ベンチマーク結果

BenchmarkDotNetで計測した結果です.

Method Mean Error StdDev Gen 0 Allocated
old 312.01 ms 6.084 ms 10.165 ms 112000.0000 940,000,592 B
oldAnd 440.20 ms 8.707 ms 12.763 ms 28000.0000 240,000,504 B
oldAnd5 99.64 ms 1.719 ms 1.688 ms - 111 B
efficient 15.67 ms 0.203 ms 0.170 ms - 8 B
efficient2 14.20 ms 0.165 ms 0.146 ms - 8 B
efficientAnd 409.58 ms 4.002 ms 3.744 ms - 480 B
efficientAnd5 28.05 ms 0.558 ms 0.705 ms - 15 B

まず,すこしだけInlineIfLambdaを使わない場合のベンチマーク結果を見比べてみます.
oldoldAndを比べると,MergeSources5Bind5Returnを実装していないにもかかわらず,アロケーションを3分の1近くにまで削減しています.しかし,実行時間は伸びてしまいました.これはValueTupleの生成をするようになって,クロージャの生成が減った代わりに,ValueTupleのコンストラクタの実行で時間を消費してしまったことが理由かもしれません.(プロファイラ見たけどよくわからなかった)
しかし,oldAnd5では,Bind5Returnの1回で済むので,非常に少ないアロケーションかつ短い時間で処理を終えることができました.(というかほとんどインライン展開されていた)

次に,InlineIfLambdaを使ったビルダでは,oldに比べて,アロケーション・実行時間ともに非常に少なくなりました.oldでは1GB近くアロケーションが起こっていたのに対し,efficientではほとんどありません.実行時間も20分の1と非常に少なくなりました.
そして,efficientefficient2では実行時間・アロケーションともに差がありませんでした.デコンパイルしてみるとほぼ同じILのプログラムが出力されていることが確認できました.
efficientAndではefficientよりも遅くなってしまいました.なぜかというと,ValueTupleの生成で時間を消費してしまったからです.デコンパイル結果がすごいことになっていました.

efficientAndのデコンパイル結果
(int, (int, (int, (int, int)))) item = fSharpValueOption10.Item;
(int, (int, (int, (int, int)))) tuple = item;
(int, (int, (int, (int, int)))) tuple2 = tuple;
(int, (int, (int, int))) item2 = tuple2.Item2;
(int, (int, int)) item3 = item2.Item2;
(int, int) item4 = item3.Item2;
int e = item4.Item2;
(int, (int, (int, int))) item5 = tuple2.Item2;
(int, (int, int)) item6 = item5.Item2;
(int, int) item7 = item6.Item2;
int d = item7.Item1;
(int, (int, (int, int))) item8 = tuple2.Item2;
(int, (int, int)) item9 = item8.Item2;
int c = item9.Item1;
(int, (int, (int, int))) item10 = tuple2.Item2;
int b = item10.Item1;
int a2 = tuple2.Item1;
res = a2 + b + c + d + e + res;
fSharpValueOption11 = FSharpValueOption<int>.NewValueSome(a2 + b + c + d + e);

しかし,efficientAnd5では,非常にスリムな形で出力されており,実行時間が10分の1以上となりました.ただ,and!を使わない実装よりはやっぱり遅くなってしまっています.ValueTupleの生成に時間がかかってしまい,遅くなったのだと考えられます.(プロファイラ見てもわからない)
インライン展開されなかった時とは違い,なんでもかんでもand!で実装すればよいのではなく,bindが時間がかかる場合にのみ実装するという使い分けが重要なのかもしれません.(TODO: あるいはValueTupleではなくてTupleを使った方が良い?)

結び

InlineIfLambdaを用いて非常に効率良くコンピュテーション式を使えるようになりました.ただ,効率の良いプログラムを生成するにはビルダを慎重に作る必要がありそうです.

English Version

Title: Create an efficient computation expression builder using InlineIfLambda
Sorry if I misuse English.

Introduction

task computation expression was implemented in F# 6.0. To implement task{}, FS-1087 was impelemented for efficient state machines. Other than that, InlineIfLambda Attribute(FS-1098) was also introduced so we can implement efficient computation expressions even if state machines are not needed.

In this article, I implement the builders for the Option computation expression with and without InlineIfLambda and compare their performance.

Option computation expression without InlineIfLambda

Let's begin to write the Option computation expression without InlineIfLambda like zecl's article "Applicative Computation Expressions".

type OldOptionBuilder () =
    member inline __.Bind<'T, 'T2>(x : 'T voption, f : 'T -> 'T2 voption) : 'T2 voption =
        ValueOption.bind f x // let!
    member inline __.BindReturn<'T, 'T2>(x : 'T voption, f : 'T -> 'T2) : 'T2 voption =
        ValueOption.map f x // let! .. return
    member inline __.Return<'T>(x : 'T) : 'T voption =
        ValueSome x // return
    member inline __.MergeSources<'T, 'T2>(x1 : 'T voption, x2 : 'T2 voption) : struct('T * 'T2) voption =
        ValueOption.map2 (fun x y -> struct(x, y)) x1 x2 // and!

Using this, I compiled an expression below.

let! v1 = ValueSome 8
let! v2 = ValueSome 4
and! v3 = ValueSome 2
return v1 + v2 + v3

Then the F# compiler generates an IL program like this.

ValueOption.bind (fun v1 -> ValueOption.map (fun v2 v3 -> v1 + v2 + v3) (ValueOption.map2 (fun v2 v3 -> struct(v2, v3)) (ValueSome 4) (ValueSome 2))) (ValueSome 8)

Now that I have refactored this code, so it looks that there are no allocations or method callings(It seems to be applied inline expansion), but actually there are many allocations or method callings such as a closure for v1.

Option computation expression with InlineIfLambda

type OptionCode<'T> = unit -> 'T voption
type NewOptionBuilder () =
    member inline __.Run<'T>([<InlineIfLambda>]code : 'T OptionCode) : 'T voption = code ()
    member inline __.Bind<'T, 'T2>(x : 'T voption, [<InlineIfLambda>] code : 'T -> 'T2 OptionCode) : 'T2 OptionCode = 
        (fun () -> match x with
            | ValueNone -> ValueNone
            | ValueSome v -> (code v) ()) // let!
    member inline __.BindReturn<'T, 'T2>(x : 'T voption, [<InlineIfLambda>] f : 'T -> 'T2) : 'T2 OptionCode =
        (fun () -> match x with
            | ValueNone -> ValueNone
            | ValueSome v -> ValueSome (f v)) // let! .. return
    member inline __.Return<'T>(x : 'T) : 'T OptionCode = 
        (fun () -> ValueSome x) // return
    member inline __.MergeSources<'T, 'T2>(x1 : 'T voption, x2 : 'T2 voption) : struct('T * 'T2) voption =
        if x1.IsSome && x2.IsSome  then ValueSome struct(x1.Value, x2.Value) else ValueNone // and!

OptionCode<'T> is a function type, I replaced voption with OptionCode from the previous version. (Except for MergeSorces) Even if voption can produce efficient programs, but, in test of F# compiler returns functions, so I follow. (MergeSources throws a compiler error.)
Then I added InlineIfLambda attribute to the argument code in Bind function, and I wrote (code v) () in the application. Note that (code v ()) cannot be inline-expanded. The F# compiler does not optimize it because the F# compiler produces a honest program like code.Invoke(v, new unit()).
We want to use the functions in FSharp.Core such as ValueOption.map, but unfortunately these functions do not have InlineIfLambda attribute. So that, if you use, the inline expansion does not perform and you cannot produce effecient programs...
I make it return OptionCode, so I defined the Run method. Without this method, the computation expression solely returns OptionCode<'T>, it makes allocations and it is not an expected behaviour.

If you use this new way, the previous example will be below IL program. (free translation)

match ValueSome 8 with
| ValueSome v1 ->
  match struct(ValueSome 4, ValueSome 2) with
  | struct(ValueSome v2, ValueSome v3) -> 
    match ValueSome struct(v2, v3) with
    | ValueSome struct(v2, v3) -> ValueSome (v1 + v2 + v3)
    | ValueNone -> ValueNone
  | _ -> ValueNone
| ValueNone -> ValueNone

There are not function callings except ValueTuple or ValueOption related, and (or few) heap allocations.

Benchmark

Source Code : https://gist.github.com/GnicoJP/a53479b7dbadf739b8ef98df6bfb6371

Environment

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
AMD Ryzen 7 PRO 5850U with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.401
  [Host]     : .NET 6.0.9 (6.0.922.41905), X64 RyuJIT DEBUG
  DefaultJob : .NET 6.0.9 (6.0.922.41905), X64 RyuJIT

A computation expression to be executed

let mutable res = 0
for i in 1..10000000 do
    let a = someOption {
		let! a = if i % 1 = 0 then ValueSome 1 else ValueNone
		let! b = if i % 2 = 0 then ValueSome 2 else ValueNone
		let! c = if i % 3 = 0 then ValueSome 3 else ValueNone
		let! d = if i % 4 = 0 then ValueSome 4 else ValueNone
		let! e = if i % 5 = 0 then ValueSome 5 else ValueNone
		res <- a + b + c + d + e + res
		return a + b + c + d + e
	    }
    a |> ignore
res

Repeat i from 1 to 10^7, then if i can be dividable by 1 ~ 5(or 60), then res will be added by 15.

I prepared an and! version, and! appeared in F# 5.0.

let mutable res = 0
for i in 1..10000000 do
    let a = someOption {
		let! a = if i % 1 = 0 then ValueSome 1 else ValueNone
		and! b = if i % 2 = 0 then ValueSome 2 else ValueNone
		and! c = if i % 3 = 0 then ValueSome 3 else ValueNone
		and! d = if i % 4 = 0 then ValueSome 4 else ValueNone
		and! e = if i % 5 = 0 then ValueSome 5 else ValueNone
		res <- a + b + c + d + e + res
		return a + b + c + d + e
	    }
    a |> ignore
res

Builders

First, previous builders are:

  • old : I used let!.
  • oldAnd : I used and! and implemented only MergeSources
  • oldAnd5 : I used and! and implemented MergeSources5, Bind5Return.

Second, efficient builders are:

  • efficient : I used let!. It returns OptionCode<'T>.
  • efficient2 : I used let!. It returns voption<'T>.
  • efficientAnd : I used and! and implemented only MergeSources
  • efficientAnd5 : I used and! and implemented MergeSources5, Bind5Return.

The result

I measured using BenchmarkDotNet.

Method Mean Error StdDev Gen 0 Allocated
old 312.01 ms 6.084 ms 10.165 ms 112000.0000 940,000,592 B
oldAnd 440.20 ms 8.707 ms 12.763 ms 28000.0000 240,000,504 B
oldAnd5 99.64 ms 1.719 ms 1.688 ms - 111 B
efficient 15.67 ms 0.203 ms 0.170 ms - 8 B
efficient2 14.20 ms 0.165 ms 0.146 ms - 8 B
efficientAnd 409.58 ms 4.002 ms 3.744 ms - 480 B
efficientAnd5 28.05 ms 0.558 ms 0.705 ms - 15 B

Let's compare the benchmark results with and without InlineIfLambda.
Comparing old and oldAnd, oldAnd's allocation is reduced, it is one third of old's even if it is not implemented MergeSources5Bind5Return, but its wallclock time gets longer. I think, it creates ValueTuple instead of closures, so it may consume time by ValueTuple's constructors. (I did not catch the meaning from the profiler...)
However oldAnd5 calls only Bind5Return, so only very low allocations and short execution time are needed. (Almost inline-expanded.)

Next, the builders with InlineIfLambda (efficient, efficient2) have lower allocation and shorter execution time than old. In old, it requires almost 1GB allocation, but efficient does not. Its wallclock time is also a twentieth as long as old's. It found that efficient and efficient2 have no difference on execution time and allocation, because they almost the same IL program.
efficientAnd is slower than efficient because ValueTuple consumes a lot of time. Its decompiled code is very complex:

Decompiled code of efficientAnd
(int, (int, (int, (int, int)))) item = fSharpValueOption10.Item;
(int, (int, (int, (int, int)))) tuple = item;
(int, (int, (int, (int, int)))) tuple2 = tuple;
(int, (int, (int, int))) item2 = tuple2.Item2;
(int, (int, int)) item3 = item2.Item2;
(int, int) item4 = item3.Item2;
int e = item4.Item2;
(int, (int, (int, int))) item5 = tuple2.Item2;
(int, (int, int)) item6 = item5.Item2;
(int, int) item7 = item6.Item2;
int d = item7.Item1;
(int, (int, (int, int))) item8 = tuple2.Item2;
(int, (int, int)) item9 = item8.Item2;
int c = item9.Item1;
(int, (int, (int, int))) item10 = tuple2.Item2;
int b = item10.Item1;
int a2 = tuple2.Item1;
res = a2 + b + c + d + e + res;
fSharpValueOption11 = FSharpValueOption<int>.NewValueSome(a2 + b + c + d + e);

However efficientAnd5 generates a very smart code and a tenth as long as old. But, it is slower than efficient, it may takes time on constructions of ValueTuple.(I did not catch the meaning from the profiler...)
It may important to think about whether implement and! only if bind takes for a long time, unlike no-inline-expanded computation expression. (TODO: Or, prefer Tuple to ValueTuple?)

Conclusion

We can use computation expressions efficiently with InlineIfLambda. However we must design carefully to generate efficient programs.

Discussion