iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🎉

Assembly Language for Software Developers Without Hardware Knowledge (Draft)

に公開
5

This article is a work in progress, so the content (including the title) will be rewritten from time to time. While it is in draft form, there are many rough parts regarding the accuracy of the content and the consistency of the document as a whole. Please be understanding.

Introduction

This article introduces common pitfalls that software developers tend to encounter when entering the low-level domain close to hardware, especially when they first encounter assembly language. The main target audience is those who started software development with application development using scripting languages such as JavaScript or Python and are unfamiliar with the layers below that.

I have always had the impression that while assembly language is not technically extremely difficult, a very large number of people stumble during their learning. I speculate that one of the main reasons for this is that the "common sense" of assembly language programming—which is designed to be easily interpreted by machines—is far removed from that of the high-level programming languages (hereafter referred to as "high-level languages") that everyone is usually accustomed to, which are designed to be user-friendly for humans.

The purpose of this article is not to explain the technical details of assembly language, but rather to explain things that are complete common sense to programmers who can code in assembly, even though they are entirely unfamiliar to modern application programmers. By doing so, I aim to reduce the factors that cause people to stumble when learning assembly language. For more in-depth information, please refer to the bibliography.

You Must Consider Two Types of Areas for Storing Data

In high-level languages, you use variables to store and calculate data. In the context of object-oriented programming languages, you can replace this with class instances, objects, and so on. Variables can seemingly be created without limit[1]. Program execution proceeds by reading the value of a variable, calculating it, and writing it back to a variable... and so on. For simplicity, we will not consider storage devices such as SSDs here.

In the world of assembly language, common sense changes. First, in the assembly world, you need to think about two things for storing and calculating data: one is memory, and the other is registers.

Name Role Capacity Physical Location
Register Storing and calculating data Around several to dozens of integers. In the x86_64 architecture CPUs used in PCs, there are 16. Inside the CPU
Memory Storing data only[2] Several GB to dozens of GB in modern PCs (hundreds of millions to billions in terms of integer data). Inside an independent device (memory) outside the CPU[3]

In a very short program, all data might fit into registers, but in a practical program, data is normally stored in memory. When performing calculations, the data on memory is temporarily read into registers inside the CPU, calculated, and the results are written back to memory. This is the general flow.

You might think that if you just increased the number of registers and put all the data in them, it would solve the problem without this complexity. However, it's not that simple. Doing so is not realistic because it causes the following problems:

  • The CPU becomes physically enormous
  • Processing slows down
  • Power consumption and heat generation increase
  • Manufacturing costs increase

Instead, memory—hardware that can increase capacity much more cheaply than registers—is used because there is no other choice.

Methodology

Next, we will learn the very basics of assembly language while looking at simple, concrete assembly code. We will mainly deal with assembly language written for x86_64 Linux using the GNU Assembler (GAS). We will use Ubuntu 18.04 as the development environment.

The explanation will follow roughly this template:

  1. Look at C code[4], a high-level language, for a program that performs a certain task.
  2. Look at the code to achieve the same thing in x86_64 assembly language.

The reason for choosing x86_64 is that it has the most reference materials and is easy to set up a development environment for due to its high penetration rate. However, since x86_64 is one of the most complex architectures among well-known ones, it is not particularly easy to write assembly for. Therefore, I will add supplementary information, such as giving examples for other architectures, to make it easier to understand as needed.

x86_64 Register Names Are Hard to Remember

The number and names of registers vary depending on the architecture. For instance, in the Arm64 architecture widely used in smartphones and tablets, there are 31 general-purpose registers used for integer arithmetic, named X0 to X30. The RISC-V architecture also has 31 general-purpose registers, named X1 to X31.

Now, what about x86_64? x86_64 has 16 general-purpose registers. Their names are rax, rbx, rcx, rdx, rdi, rsi, rbp, rsp, r8, r9, r10, r11, r12, r13, r14, and r15. The names are so inconsistent it almost feels like a prank. However, there are complex historical reasons behind this naming. I will talk about them later, but for now, it is more constructive to just think "that's how it is" and memorize them quickly.

One important thing to note about the names is that the letter "r" at the beginning of all registers indicates that they are 64-bit registers. You might wonder if there are registers with other bit counts, and there actually are, but I will explain those in a separate section.

I mentioned that general-purpose registers are used for integer arithmetic, but they actually have other uses as well. As for which register is used for what, I will touch upon that in another section. Furthermore, besides general-purpose registers, there are various other types, such as registers for floating-point arithmetic and control registers used mainly by the OS, which I will also discuss in a separate section.

The Appearance of the Code is Not Intuitive and it is Tough

Below is a simple addition instruction written in C.

a = a + b

The ARM64 assembly language that does the same thing would look like this:

add x1, x1, x2

It is tough because the appearance does not resemble a mathematical formula like C and C-like languages. This means adding the value of the second operand (x1 register) and the third operand (x2 register) and putting the result into the first operand (x1 register). Once told, you can understand it, but for those not used to assembly language, there are "first-time traps" such as not using familiar arithmetic symbols like "+", "-", "*", and "/".

Using the above code as an example, I will explain some terms to be used in the rest of this article.

  • Mnemonic: A string representation of machine language (composed only of numbers) that is easier for humans to read. In the example above, the "add x1, x1, x2" part.
  • Opcode: Indicates what kind of instruction it is. In the code above, the "add" part.
  • Operand: The target of the opcode's processing. In the code above, the three elements separated by ",".

If you compare it to a function in a high-level language, it's easy to think of the opcode as the function or method name, and the operands as the arguments. Let's just memorize these terms quickly. You don't need to memorize them all at once; just look back at this section if you get confused.

x86_64 Code Is Even Tougher

The previous point about code not being intuitive gets even worse with x86_64, as shown below.

addq %rbx, %rax

There are two points to criticize here: first, "What is 'q'?" and second, "Aren't there too few arguments?" First, regarding "q," it indicates that a 64-bit integer operation is being performed. As for the second point, you can only understand it once you know the definition of the x86_64 add instruction. That definition is: "Add the value of the first operand (%rbx) and the value of the second operand (%rax), and store the result in the second operand (%rax)." From the perspective of a high-level language user, it couldn't be harder to read, but since the CPU architecture is designed that way, there is no choice. Let's just give up and accept it.

Since "giving up and accepting it" isn't very helpful, let me explain one of the reasons for this. One benefit of having fewer operands is that the CPU circuitry can be simplified. If we trace it back, x86_64 started from a CPU called the 8086, which was a 16-bit architecture[5]. Presumably, at that time, there was a need to simplify the circuitry as much as possible, so the addition operation took this form (registers were 16-bit then), and since then, it has remained the same because new CPU architectures have been created while maintaining binary-level compatibility.

There may be multiple mnemonics even for the same architecture

You might think that there would be only one set of mnemonics for the same architecture, but that's not the case. A typical example is x86_64. As those who have already dabbled in x86_64 assembly language might know, there are two types of mnemonic syntax for this architecture. Their names are AT&T syntax and Intel syntax, respectively.

The "addq %rbx, %rax" introduced in the previous section is in AT&T syntax, and GAS adopts this syntax[6]. On the other hand, if you try to express the same thing in Intel syntax, it looks like "add rax, rbx". As the name suggests, Intel syntax is used in Intel's manuals. It is also used in the Netwide Assembler (NASM).

The most significant difference between the two syntaxes is that the order of the operands is reversed. Besides this, there are various small differences, such as the absence of the "q" at the end of the mnemonic and the absence of "%" before the register names in Intel syntax. For more details on the history of why two syntaxes exist and what specifically differs between them, the following article may be helpful. Please take a look if you are interested.

For this article, it is sufficient if you can achieve the following:

  • When you see the two notations, you can think "Oh right, there were two" without getting confused.
  • You can translate between the two notations as appropriate. Especially at first, just translate them based on feeling without being too rigid.

Mnemonics and Machine Code Are Not One-to-One

In x86_64, there are more than 20 ADD instructions.

Things in Assembly Source That Are Not Mnemonics

Extended mnemonics, macros

There Are Way Too Many Side Effects

The "first-time trap" where a calculation overwrites a flag register, or a test or jmp suddenly appears right after a calculation and implicitly uses the flag register.

You Will Die if You Suddenly Look at the Processor Manual

  • It is usually massive.
  • It is not something to read through, but something to use like a dictionary.
  • It's better to start by looking at reference sites or seeing the assembly source of simple programs compiled (using cc -S foo.c, etc.) and writing by imitation.

There Are Rules for Function Calls

  • If you don't know the conventions, you won't understand why registers that seem to contain nothing are suddenly being used.
  • Even if you try to Google it, you won't know the right search terms (like "calling convention").
  • Caller-save registers vs. callee-save registers.
  • The stack.
  • Frame pointers.
  • Even with the same architecture, the conventions vary depending on the OS, language, etc.

Rules for System Call Invocations

TBD

Optimization

  • The high-level language source and the compiled assembly language source differ completely depending on the compiler, its version, and options.
  • As an example, let's look at various results with cc -S foo.c.
  • The cause lies in differences in how things are written and the degree of optimization.
  • It is better to look at it with optimization turned off at first.

Conclusion

TBD

References

TBD

脚注
  1. If you create too many, you will run out of memory and the program will be forcibly terminated. ↩︎

  2. Depending on the CPU architecture, there are instructions that use data on memory directly for calculations, but internally, the data is eventually placed in a register. ↩︎

  3. To be precise, physical hardware memory and the memory seen from a program are not the same thing. If you want to know more, search for "virtual memory." ↩︎

  4. If you ask whether C is truly a high-level language since it can handle memory directly, the answer is complicated, but let's assume so for now. ↩︎

  5. Strictly speaking, the 8-bit CPU 8080 existed even earlier, but I will skip that explanation as it is complicated. ↩︎

  6. To be precise, Intel syntax can also be used, but the default is strictly AT&T syntax. ↩︎

Discussion

拓人拓人

最近x86-64アセンブリに興味を持ち勉強しているところ本記事を見かけましたものでコメントさせてください。
高級言語の学習と比較すると適当な練習問題(100本ノックのようなものをイメージしています)が見当たらないなと感じています。手を動かさないと知識が身につかないと感じることもあり、高級言語プログラムのコンパイル結果を見る次の段階としてそのようなトピックがあれば個人的には大変うれしく思います。
(Disucussion欄がこのようなコメントを記載する場所としてふさわしいかわかっておらず、不適当であればすみません)

satsat

ありがとうございます。ここにご意見ご要望を書いていただけるのはとてもいいことだと思います。わたし個人にとっても、読者にとっても。

練習問題については例えば以下の本がいいかもしれません。
x86-64 Assembly Language Programming with Ubuntu
http://www.egr.unlv.edu/~ed/assembly64.pdf

拓人拓人

ご返信ありがとうございます。

紹介いただいた本少し中身を覗いてみましたが、章末の練習問題(とAppendix Bの命令セットのサマリー)が勉強にとても役立ちそうです!

Ryozo OkudaRyozo Okuda

急がば回れで、6800くらいから入門すればよくないですか。

satsat

基本は普及しているx86_64やarm64を使いますが必要とあらばそうします。コメントありがとうございました。