を参照しながらWASM Runtimeを実装して見るチャレンジをしているけど仕様書の読み解き方が難しい。

大まかな章構成

章	内容
Introduction	ふんわりとした概要、コンセプト
Structure	WASMの言語仕様としての内部構造というかなんというか（型の種類とか、命令の種類とか）
Validation	上記Structureの検証方法（たぶん）
Execution	上記Structureを実行する方法（たぶん）
Binary Format	バイナリの構造からStructureに変換する方法
Text Format	wast形式のはなし（たぶん）

Runtimeを作る話だと

Binary Formatを読み込んで内部表現としてStructureの形式でメモリ上に展開
作られたStructureをValidationに従って検証する
Execution

ということで、

Binary Format
Structure
Validation
Execution

の順で理解するのが良さそう。

乳牛

仕様書を読むのも良いが、 wasm-toolsでdumpするほうがイメージがつかめたというオチ

$ wasm-tools dump wasi-copy.wasm | less
      0x0 | 00 61 73 6d | version 1 (Module)
          | 01 00 00 00
      0x8 | 01 b9 01    | type section
      0xb | 19          | 25 count
      0xc | 60 00 00    | [type 0] SubType { is_final: false, supertype_idx: None, structural_type: Func(FuncType { params: [], returns: [] }) }
      0xf | 60 01 7f 00 | [type 1] SubType { is_final: false, supertype_idx: None, structural_type: Func(FuncType { params: [I32], returns: [] }) }
     0x13 | 60 01 7f 01 | [type 2] SubType { is_final: false, supertype_idx: None, structural_type: Func(FuncType { params: [I32], returns: [I64] }) }
          | 7e

乳牛

ModuleのBinary表現

ここに書いてあるといえば書いてあるのですが、下記の記事が非常に理解の助けになりました。

大まかな構造は以下のようになっています。

4byteのMagic
4byteのバージョン番号
以降はセクションの繰り返し

Magic

頭の4バイトはWASMである限り必ず 0x00 0x61 0x73 0x6Dで始まります。
ascii文字列にすると"\0asm"となるものです。

Version

4byte 固定長のリトルエンディアンで表現された数値です。今はかならずVersion 1だそうです。

Section

1byteで表現されたSectionID（種類）
LEB128で表現されたコンテンツサイズ
コンテンツを示すバイト列（解釈の仕方はSectionの種類によって異なる）

乳牛

Custom Section

WASMとしては読み飛ばして無視しても良い（3rd partyツールが使うことを想定）
Nameとバイト列という雑な定義

乳牛

Type Section , Function Section, Code Section

この３つはWASMの関数定義において密接に関係している。

Type

It decodes into a vector of function types that represent the types component of a module.

まずはType Sctionで、これはfunction type(=関数シグニチャと捉えてよい）が列挙される。

f(x:T1 ,y:T2, z:T3) -> (a:T4 ,b:T5 ,c:T6)
f(w: T8) -> (d: T9)
f(s: T10, t: T11) -> (e: T12)

C言語のヘッダのような存在。

Code

順序的にはFunctionセクションの方が先なのだがCodeセクションを理解しないとFunctionセクションの意義がわからなかったのでこちらから先に紐解く。先のTypeセクションがヘッダだとしたらこちらは関数本体の定義となっている。

It decodes into a vector of code entries that are pairs of value type vectors and expressions. They represent the locals and body field of the functions in the funcs component of a module.

ごちゃごちゃとした記法になっているが要はCodeと言うのは関数本体の記述を指し、関数本体は下記のような構造体で、Codeセクションではローカル変数の型の列挙と関数本文のInstructionの列挙となっている。

struct Code {
    //後述のfunctionセクションからデコードする
    typeidx: u32, 
    //Codeセクションからデコードする
    locals: Vec<ValType>,
    body: Vec<Instruction>,
}

少し混乱するのが、locals自体が $t^n$ となっているので入れ子の配列のような構造が想定されているが結局のところflattenされる想定のようだ。

The meta function concat((t*)) concatenates all sequences ti in (t*)*.

Function

It decodes into a vector of type indices that represent the type fields of the functions in the funcs component of a module. The locals and body fields of the respective functions are encoded separately in the code section.

typeidxは要するにu32なので整数の配列になるのだが、その数字が何を示しているのかというとtypesの要素番号になる。

このFunctionsはCodeとTypesをリンク付情報となっていて、下図のような関係性となる。

Functionは

乳牛

Import Section

乳牛

WASMにおける型

ここに定義されていますが、非常にシンプルです。

Byte
Integer - 整数
Float - 小数
Vector - 配列じゃなくて、SIMD命令に渡す128bitのかたまり
Name - 名前（文字列）

乳牛

Byte

これはまぁ、言わずもがなの0-255の数字で表現できます。
Rustだとu8。

乳牛

Integer

unsignedとsignedがあるのはわかる。

定義上は iN ::= uN と書いてあるのに下記の一文がよくわからない。

The class iN defines uninterpreted integers, whose signedness interpretation can vary depending on context.

Nの取りうる値についてはNoteから 8, 16, 32, 64のいずれかと読み取れるが、実質は32 or 64と見てよいだろう。8と16については i8 or i16の選択肢しか無いように見える。

Binary表現

All integers are encoded using the LEB128 variable-length integer encoding

LEB128という可変長のバイト表現は本件で初めて知ったが「1バイトでも削りたい」というファミコン時代の先人たちの工夫に通ずるものがある。RustにおいてはLEB128クレートを使えば恐れるものではない。

乳牛

Float

乳牛

Vector

Numeric vectors are 128-bit values that are processed by vector instructions (also known as SIMD instructions, single instruction multiple data). They are represented in the abstract syntax using i128. The interpretation of lane types (integer or floating-point numbers) and lane sizes are determined by the specific instruction operating on them.

配列的な意味でのVectorなのかと思っていたが異なるものだった。
SIMDの文脈でのVectorで、下図の例だと (5,9,2,8)という入力を順次5->9->2->8と処理するのではなく一括で処理する。

この例では32bitだが、WASMのVectorは128bitなので 32bit x 4とか、64bit x 2 とかで使うようだ。

Binary表現

なぜひっそりとこのページに書いてあるのかわからない。

まずu32(LEB128)で要素数が定義されている
要素数分の要素が繰り返される

乳牛

Name

Names are sequences of characters, which are scalar values as defined by Unicode (Section 2.4).

まぁ、UTF-8の文字列という意味で良さそう。

Binary表現

ややこしい定義の書き方になってるけどUTF-8が面によって1-4byteの可変長エンコーディングが書かれているだけ。

乳牛

Types

Valuesで定義されている型とここで定義されている型の違いがよくわかっていない。
書いてるうちにわかってくるだろ・・・

Appendixにコンパクトな表としてまとめられている。

乳牛

Number Types

整数 or 浮動小数点 ✕ 32bit or 64bit の組み合わせで上記４種というのはわかる。

Integers are not inherently signed or unsigned, their interpretation is determined by individual operations.

見たところInstructionにはu32やs64と言った型を明示しているものはないので、最左ビットが1である場合にそれを

2の補数表現（つまりマイナス）と見るか
正整数と見るか

は本当に気持ち次第（文脈次第）ということのようだ。

乳牛

Vector Types

The type corresponds to a 128 bit vector of packed integer or floating-point data. The packed data can be interpreted as signed or unsigned integers, single or double precision floating-point values, or a single 128 bit type.

前述のValueのところで述べた通りのSIMD文脈でのVectorについて書かれている。

乳牛

Reference Types

funcref か externrefのいずれか

The type denotes the infinite union of all references to functions, regardless of their function types.

The type denotes the infinite union of all references to objects owned by the embedder and that can be passed into WebAssembly under this type.

乳牛

Value Types

前述のいずれか

Number
Vector
Reference

乳牛

Result Types

Result types classify the result of executing instructions or functions, which is a sequence of values, written with brackets.

要は複数のValueTypeが配列的に並んでいるということ

乳牛

Function Types

Function types classify the signature of functions, mapping a vector of parameters to a vector of results. They are also used to classify the inputs and outputs of instructions.

f(x:T1 ,y:T2, z:T3) -> (a:T4 ,b:T5 ,c:T6) 的な意味。