iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
📘

Introduction to x86-64 Machine Code: AVX/AVX-512

に公開

This article is a sequel to Introduction to x86-64 Machine Code. I will introduce SSE/AVX/AVX-512 and the associated VEX and EVEX prefixes.

SSE

To handle single-precision and double-precision floating-point numbers in x86-64, features (instruction sets and registers) called SSE (Streaming SIMD Extensions) are used. Originally an instruction set extension, SSE2 is always available in x86-64, so you can consider it a standard feature.

In SSE, sixteen 128-bit wide registers (eight in 32-bit mode) are available. Regarding data types, the original SSE supports single-precision floating-point numbers (32-bit width), while SSE2 and later support double-precision floating-point numbers (64-bit width) and integers.

The 128-bit wide SSE registers are called XMM registers and are numbered xmm0, xmm1, ..., xmm15.

SSE is used not only for SIMD operations but also for scalar operations (as a replacement for the x87 FPU).

SIMD Prefix

Let's check the opcodes for ADD{S,P}{S,D}, which are representative SSE instructions for floating-point addition. By the way, if the last character is 'S', it's single precision (single precision); 'D' is double precision (double precision). If the second-to-last character is 'S', it's scalar (1 element); 'P' is vector (packed).

  • ADDSS: F3 0F 58
  • ADDSD: F2 0F 58
  • ADDPS: NP 0F 58
  • ADDPD: 66 0F 58

The first thing to notice is that the last two bytes are always 0F 58. Among these, 0F means "the first byte of a 2-byte opcode" (explained later).

The first byte varies, such as 66, F2, or F3. These are prefixes. 66 was mentioned in the previous article as an operand-size prefix. When used for SIMD instructions, 66/F2/F3 are specifically called SIMD prefixes.

ADDPS is written with "NP" at the beginning. This stands for "No Prefix". In other words, when no prefix is added to the opcode 0F 58, it is interpreted as ADDPS, and adding a prefix switches the precision or scalar/vector mode.

2-byte/3-byte Opcodes and Opcode Space

x86 machine code instructions are multiples of 8 bits (1 byte) in length, and the first byte or first few bytes correspond to prefixes or opcodes. To ensure that machine code interpretation is not ambiguous, the first byte of a prefix or opcode must not overlap with others. This provides 256 possibilities (1 byte).

Which instruction each "first byte" represents is documented in Appendix A, One-byte Opcode Map, of Intel SDM Vol. 2.

Since there are more than 256 types of x86 instructions, some instruction opcodes are represented using two or more bytes. The first byte of such opcodes is 0F. Opcode 0F is called a 2-byte escape. For instructions with 3-byte opcodes, 0F is followed by 38 or 3A, and then by the specific opcode for that instruction.

In essence, opcodes generally take one of the following forms and can be enumerated across four tables (or maps): (As explained in the previous article, there are cases where the opcode is less than 8 bits or where the ModR/M byte is also used.)

  • <1 byte>
  • 0F <1 byte>
  • 0F 38 <1 byte>
  • 0F 3A <1 byte>

Intel APX documentation uses the following terminology:

  • legacy map 0: Instructions representable in 1 byte without using an escape
  • legacy map 1: Instructions representable in 1 byte following 0F
  • legacy map 2: Instructions representable in 1 byte following 0F 38
  • legacy map 3: Instructions representable in 1 byte following 0F 3A

Using the EVEX prefix allows access to the fourth and subsequent maps. These are utilized in newer instruction set extensions like AVX512-FP16 and APX:

  • map 4: Used in APX
  • map 5: Used in AVX512-FP16
  • map 6: Used in AVX512-FP16
  • map 7: Used in APX

AVX

In AVX (Advanced Vector Extensions), the width of the vector registers introduced in SSE was extended to 256 bits. When referring to vector registers as 256-bit wide, they are called YMM. XMM became the lower 128 bits of YMM.

In terms of instructions, a 3-operand format became available. Previously, in a C-like fashion, you had to destroy one of the input registers like x += y, but now you can specify an output register different from the inputs, like x = y + z.

What is important in terms of machine code is that AVX introduced a new prefix called the VEX prefix (vector extension prefix).

VEX Prefix

The REX prefix explained in the previous article provided the ability to access 16 registers for instructions or specify the width of operands.

The VEX prefix also has several functions. Specifically, they are as follows:

  • Selecting the vector width (256-bit or 128-bit)
  • Specifying an input register (separate from the output)
  • REX prefix functionality
  • SIMD prefix functionality
  • Compression of the opcode escape parts (0F / 0F 38 / 0F 3A)

Machine code for instructions using the VEX prefix consists of the following parts:

  1. Legacy prefixes (if any)
    • However, LOCK, 66, F2, F3, and REX cannot be used.
  2. VEX prefix (2 or 3 bytes)
  3. Opcode (1 byte)
  4. ModR/M (1 byte)
  5. SIB byte (if any; 0-1 byte)
  6. Displacement (if any; 0, 1, 2, or 4 bytes)
  7. Immediate (if any; 0-1 byte)

There are 2-byte and 3-byte formats for the VEX prefix. The first byte of the 2-byte format is C5, and the first byte of the 3-byte format is C4.

2-byte format:

Byte 1 (0xC5):
1100 0101

Byte 2:
Rvvv vLpp
^\___/^\/
|  |  | +- SIMD prefix substitute
|  |  +--- Vector length
|  +------ Register number (bit-inverted)
+--------- REX.R (bit-inverted)

3-byte format:

Byte 1 (0xC4):
1100 0100

Byte 2:
RXBm mmmm
^^^\____/
|||   +--- Opcode map selection
||+------- REX.B (bit-inverted)
|+-------- REX.X (bit-inverted)
+--------- REX.R (bit-inverted)

Byte 3:
Wvvv vLpp
^\___/^\/
|  |  | +- SIMD prefix substitute
|  |  +--- Vector length
|  +------ Register number (bit-inverted)
+--------- Similarity to REX.W

Since the 3-byte format fields include all the fields of the 2-byte format, any instruction that can be encoded in the 2-byte format can also be encoded in the 3-byte format.

The vvvv part specifies the register number (0 to 15) in bit-inverted form. For instructions that do not use it, 0b1111 is set. When referred to in instruction operand descriptions, it is called VEX.vvvv.

The equivalents of the R, X, and B fields that were in the REX prefix are specified in bit-inverted form within the VEX prefix.

L is the vector length, where 0 is 128-bit and 1 is 256-bit. In opcode descriptions, it is written as VEX.128 or VEX.256. If LIG is written instead of 128/256, it means the value of L is ignored. For LZ, L must be 0.

pp is a substitute for the SIMD prefix, where 0b00 corresponds to No Prefix, 0b01 to 66, 0b10 to F3, and 0b11 to F2.

mmmmm switches the opcode map. That is, it can represent 0F, 0F 38, or 0F 3A. 0b00001 represents 0F, 0b00010 represents 0F 38, and 0b00011 represents 0F 3A. Other values are reserved. In the case of the 2-byte format, 0F is always implied.

Since pp and mmmmm perform the roles of other prefixes and escapes, the increase in instruction length due to the adoption of the VEX prefix is mitigated.

128-bit instructions encoded using the VEX prefix clear the upper bits of the output register (the upper 128 bits if the width is 256 bits).

In individual instruction descriptions in the Intel SDM, VEX prefixes are written in the form VEX.{128,256,LIG,LZ}.{NP,66,F2,F3}.{0F,0F3A,0F38}.{W0,W1,WIG}. For example, for the 128-bit VADDPS instruction, it would be VEX.128.0F.WIG. This means L=0, pp=0b00, mmmmm=0b00001 (the 2-byte format is also possible), and W is ignored (WIG). For details, refer to Intel SDM Volume 2, Section 3.1.1.2.

AVX-512

In AVX-512, the vector register width has been extended to 512 bits. When referring to vector registers as 512-bit wide, they are called ZMM. YMM became the lower 256 bits of ZMM. The number of registers has also doubled from 16 to 32 (in 64-bit mode).

What is important in terms of machine code is that a new prefix called the EVEX (enhanced VEX) prefix was introduced. This allows access to the various added features.

The features are summarized as follows:

  • VEX prefix functionality
  • Extension of vector register width to 512 bits
  • Extension of the number of SIMD registers to 32
  • Masking
  • Features depending on the instruction type: embedded broadcast, static rounding mode specification, suppression of floating-point exception status flag manipulation

EVEX Prefix

The EVEX prefix is 4 bytes long. The first byte is fixed, and the remaining three bytes are packed with information.

Byte 1 (0x62):
0110 0010

Byte 2 (P0=P[7:0]):
RXBR' 0mmm
^^^^  ^\_/
||||  | +-- Opcode map selection
||||  +---- Reserved (In APX, B4 (ModR/M.r/m, 4th bit of SIB.base))
|||+------- 4th bit of ModR/M.reg (bit-inverted)
||+-------- Equivalent to REX.B (3rd bit of ModR/M.r/m, SIB.base) (bit-inverted)
|+--------- Equivalent to REX.X (3rd bit of SIB.index) (bit-inverted)
+---------- Equivalent to REX.R (3rd bit of ModR/M.reg) (bit-inverted)

Byte 3 (P1=P[15:8]):
Wvvv v1pp
^\___/^\/
|  |  | +- SIMD prefix substitute (same as VEX.pp)
|  |  +--- Fixed (In APX, bit-inverted 4th bit of SIB.index; in AVX10, used for 256-bit static rounding)
|  +------ Register number (same as VEX.vvvv) (bit-inverted)
+--------- Operand size promotion/Opcode extension

Byte 4 (P2=P[23:16]):
zL'Lb V'aaa
^\_/^ ^ \_/
| | | |  +-- opmask register specifier
| | | +----- Combined with vvvv (bit-inverted)
| | +------- broadcast/RC/SAE context
| +--------- Vector length/RC
+----------- Zeroing/Merging

The opcode maps are as mentioned before:

  • 1 (0b001): Corresponds to 0F
  • 2 (0b010): Corresponds to 0F38
  • 3 (0b011): Corresponds to 0F3A
  • 4 (0b100): Used in APX
  • 5 (0b101): Used in AVX512-FP16
  • 6 (0b110): Used in AVX512-FP16
  • 7 (0b111): Used in APX

Registers are specified as follows:

  • REG: A total of 5 bits from EVEX.R':EVEX.R:modrm.reg
  • VVVV: A total of 5 bits from EVEX.V':EVEX.vvvv
    • Why are the positions of V' and vvvv so far apart?
  • When SIB is not present:
    • RM: A total of 5 bits from EVEX.X:EVEX.B:modrm.r/m
  • When SIB is present:
    • BASE: A total of 4 bits from EVEX.B:modrm.r/m (one more bit is added in APX to make 5 bits)
    • INDEX: A total of 4 bits from EVEX.X:sib.index (one more bit is added in APX to make 5 bits)
    • VIDX: A total of 5 bits from EVEX.V':EVEX.X:sib.index

Regarding the role of EVEX.b: In floating-point instructions involving rounding where the operands are registers only, if EVEX.b=1, then static rounding specification and SAE (Suppress All Exceptions) become active. Rounding is as follows:

  • EVEX.RC=0b00: to nearest
  • EVEX.RC=0b01: downward
  • EVEX.RC=0b10: upward
  • EVEX.RC=0b11: toward zero

Since EVEX.L'L and EVEX.RC share the same physical bits, vector length cannot be specified (only scalar and 512-bit are possible). In the case of instructions that perform operations by reading from memory, EVEX.b controls broadcasting. In other instructions, EVEX.b must be 0.

Regarding EVEX.z: EVEX.z=0 corresponds to merging-masking, and EVEX.z=1 corresponds to zeroing-masking.

In individual instruction descriptions in the Intel SDM, EVEX prefixes are written in the form EVEX.{128,256,512,LLIG,LLZ}.{NP,66,F2,F3}.{0F,0F3A,0F38,MAP4,MAP5,MAP6,MAP7}.{W0,W1,WIG}. The details are as follows:

  • {128,256,512,LLIG,LLZ}: Represents the vector length. Encoded with EVEX.L'L (when EVEX.RC is not used). LLIG means the vector length is ignored (e.g., scalar instructions). LLZ means EVEX.L'L must be 0.
    • EVEX.L'L=0b00: 128-bit
    • EVEX.L'L=0b01: 256-bit
    • EVEX.L'L=0b10: 512-bit
    • EVEX.L'L=0b11: Reserved
  • {NP,66,F2,F3}: Equivalent to SIMD prefixes. Encoded with EVEX.pp.
    • EVEX.pp=0b00: No Prefix
    • EVEX.pp=0b01: 66
    • EVEX.pp=0b10: F3
    • EVEX.pp=0b11: F2
  • {0F,0F3A,0F38,MAP4,MAP5,MAP6,MAP7}: Represents the opcode map. Encoded with EVEX.mmm.
    • EVEX.mmm=0b001: 0F
    • EVEX.mmm=0b010: 0F38
    • EVEX.mmm=0b011: 0F3A
    • EVEX.mmm=0b100: MAP4
    • EVEX.mmm=0b101: MAP5
    • EVEX.mmm=0b110: MAP6
    • EVEX.mmm=0b111: MAP7
  • {W0,W1,WIG}: Represents the value of EVEX.W. W0 is EVEX.W=0, W1 is EVEX.W=1, and WIG means EVEX.W is ignored.

For example, for a 512-bit VADDPS instruction, it would be EVEX.512.0F.W0. This means L'L=0b10, pp=0b00, mmm=0b001, and W=0. For details, refer to Intel SDM Volume 2, Section 3.1.1.2.

Instruction Examples

ADDPS Instruction

Let's look at the ADDPS/VADDPS instructions, which perform the addition of vectors consisting of single-precision floating-point numbers.

ADDPS -- Add Packed Single-Precision Floating-Point Values

Opcode/Instruction Op/En 64/32 bit Mode Support CPUID Feature Flag Description
NP 0F 58 /r
ADDPS xmm1, xmm2/m128
A V/V SSE Add packed single-precision floating-point values from xmm2/m128 to xmm1 and store result in xmm1
VEX.128.0F.WIG 58 /r
VADDPS xmm1, xmm2, xmm3/m128
B V/V AVX Add packed single-precision floating-point values from xmm3/m128 to xmm2 and store result in xmm1.
VEX.256.0F.WIG 58 /r
VADDPS ymm1, ymm2, ymm3/m256
B V/V AVX Add packed single-precision floating-point values from ymm3/m128 to ymm2 and store result in ymm1.
EVEX.128.0F.W0 58 /r
VADDPS xmm1 {k1}{z}, xmm2, xmm3/m128/m32bcst
C V/V AVX512VL
AVX512F
Add packed single-precision floating-point values from xmm3/m128/m32bcst to xmm2 and store result in xmm1 with writemask k1.
EVEX.256.0F.W0 58 /r
VADDPS ymm1 {k1}{z}, ymm2, ymm3/m256/m32bcst
C V/V AVX512VL
AVX512F
Add packed single-precision floating-point values from ymm3/m256/m32bcst to ymm2 and store result in ymm1 with writemask k1.
EVEX.512.0F.W0 58 /r
VADDPS zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst {er}
C V/V AVX512F Add packed single-precision floating-point values from zmm3/m512/m32bcst to zmm2 and store result in zmm1 with writemask k1.

Instruction Operand Encoding

Op/En Tuple Type Operand 1 Operand 2 Operand 3 Operand 4
A NA ModRM:reg (r, w) ModRM:r/m (r) NA NA
B NA ModRM:reg (w) VEX.vvvv (r) ModRM:r/m (r) NA
C Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) NA

First, let's look at the machine code for the basic SSE form. NP means No Prefix, 0F 58 is a 2-byte opcode, and /r means that the lower 3 bits of the register number are stored in the reg/opcode field of ModR/M.

For example, addps xmm3, xmm5 can be encoded with the 3 bytes 0F 58 DD:

opcode:
0x0F
0x58

ModR/M:
0b11 011 101 = 0xDD
     \\_/ \\_/
      |   +-- r/m (operand 2)
      +------ reg (operand 1)

As another example, addps xmm3, xmm10 can be encoded with the 4 bytes 41 0F 58 DA:

REX.B:
0b0100 0001 = 0x41
       ^^^^
       |||+- B (Extension for operand 2)
       ||+-- X (Extension for index)
       |+--- R (Extension for operand 1)
       +---- W

opcode:
0x0F
0x58

ModR/M:
0b11 011 010 = 0xDA
     \\_/ \\_/
      |   +-- r/m (operand 2)
      +------ reg (operand 1)

Next, let's look at the VEX.128 version. The mnemonic in assembly language becomes vaddps. The difference from the SSE version is the number of operands and whether the upper bits are cleared.

For example, vaddps xmm3, xmm5, xmm10 is encoded as C4 C1 50 58 DA. Since it's necessary to specify VEX.B, we used the 3-byte format of the VEX prefix instead of the 2-byte format.

VEX (3-byte):
0xC4

0b110 00001 = 0xC1
  ^^^ \\___/
  |||   +---- mmmmm: 0b00001=0F
  ||+-------- B (Extension for operand 3) (bit-inverted)
  |+--------- X (bit-inverted)
  +---------- R (Extension for operand 1) (bit-inverted)

0b0 1010 0 00 = 0x50
  ^ \\__/ ^ \\/
  |   |  | +-- pp: 00=No Prefix
  |   |  +---- L: 0=128-bit
  |   +------- vvvv (operand 2) (bit-inverted)
  +----------- W

opcode:
0x58

ModR/M:
0b11 011 010 = 0xDA
     \\_/ \\_/
      |   +-- r/m (operand 3)
      +------ reg (operand 1)

The VEX.256 version uses ymm as operands in assembly language. For example, vaddps ymm15, ymm3, ymm0 can be encoded as C5 64 58 F8:

VEX (2-byte):
0xC5

0b0 1100 1 00 = 0x64
  ^ \\__/ ^ \\/
  |   |  | +-- pp: 00=No Prefix
  |   |  +---- L: 1=256-bit
  |   +------- vvvv (operand 2) (bit-inverted)
  +----------- R (Extension for operand 1) (bit-inverted)

opcode:
0x58

ModR/M:
0b11 111 000 = 0xF8
     \\_/ \\_/
      |   +-- r/m (operand 3)
      +------ reg (operand 1)

The EVEX.512 version uses zmm as operands in assembly language. For example, vaddps zmm15, zmm24, zmm3 can be encoded as 62 71 3C 40 58 FB:

EVEX:
0x62

0b0111 0 001 = 0x71
  ^^^^ ^ \\_/
  |||| |  +-- Opcode map selection: 0b001=0F
  |||| +----- Reserved
  |||+------- R': 4th bit of operand 1 (bit-inverted)
  ||+-------- B: 3rd bit of operand 3 (bit-inverted)
  |+--------- X: 4th bit of operand 3 (bit-inverted)
  +---------- R: 3rd bit of operand 1 (bit-inverted)

0b0 0111 1 00 = 0x3C
  ^ \\__/ ^ \\/
  |   |  |  +- pp: 00=No Prefix
  |   |  +---- Fixed
  |   +------- vvvv (operand 2) (bit-inverted)
  +----------- W: W0

0b0 10 0 0 000 = 0x40
  ^ \\/ ^ ^ \\_/
  |  | | |  +-- opmask register specifier
  |  | | +----- V': 4th bit of operand 2 (bit-inverted)
  |  | +------- b: broadcast/RC/SAE context
  |  +--------- L'L/RC: Vector length=512-bit
  +------------ Zeroing/Merging

opcode:
0x58

ModR/M:
0b11 111 011 = 0xFB
     \\_/ \\_/
      |   +-- r/m (operand 3)
      +------ reg (operand 1)

Let's also look at an example using broadcast. vaddps ymm13, ymm30, [r12] {1to8} can be encoded as 62 51 0C 30 58 2C 24:

EVEX:
0x62

0b0101 0 001 = 0x51
  ^^^^ ^ \\_/
  |||| |  +-- Opcode map selection: 0b001=0F
  |||| +----- Reserved
  |||+------- R': 4th bit of operand 1 (bit-inverted)
  ||+-------- B: 3rd bit of base (bit-inverted)
  |+--------- X: 4th bit of index (bit-inverted)
  +---------- R: 3rd bit of operand 1 (bit-inverted)

0b0 0001 1 00 = 0x0C
  ^ \\__/ ^ \\/
  |   |  |  +- pp: 00=No Prefix
  |   |  +---- Fixed
  |   +------- vvvv (operand 2) (bit-inverted)
  +----------- W: W0

0b0 01 1 0 000 = 0x30
  ^ \\/ ^ ^ \\_/
  |  | | |  +-- opmask register specifier
  |  | | +----- V': 4th bit of operand 2 (bit-inverted)
  |  | +------- b: broadcast
  |  +--------- L'L/RC: Vector length=256-bit
  +------------ Zeroing/Merging

opcode:
0x58

ModR/M:
0b00 101 100 = 0x2C
     \\_/ \\_/
      |   +-- r/m (operand 3)
      +------ reg (operand 1)

SIB:
0b00 100 100 = 0x24
  \\/ \\_/ \\_/
   |  |   +-- base
   |  +------ index
   +--------- scale: arbitrary

Even if you only use xmm, accessing xmm16 or later requires encoding using EVEX. Therefore, addps xmm16, xmm24 is an invalid instruction, and vaddps must be used.

Discussion