iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🐧

WasmLinux: Bringing the Linux Kernel to WebAssembly

に公開

It's definitely interesting if the Linux kernel runs in a web browser, but the road to get there is quite difficult. As a first step, I compiled LKL (https://github.com/lkl/linux), a user-land version of the Linux kernel, into WebAssembly, converted it to C using wasm2c, then compiled it with Visual Studio 2022 to run on Windows.

It's not running in a web browser yet, but I was able to port it to Wasm more easily than I thought (personal opinion). If this gets a good reaction, I'll continue with: MUSL libc porting edition → Device driver utilization edition → Running on web browser edition, etc.

EDIT: Retook the Visual Studio screenshot. Since memory-control can only save anonymous maps, over the next few years, I think emulation (or a dedicated execution environment) is the only solution.

Pioneers

Naturally, there are pioneers for this topic. I've cited them before in my article Trying out USB/IP with LKL.

https://retrage01.hateblo.jp/entry/2018/07/21/153000

The one above is a straightforward port using Emscripten's pthread emulation.

What has been achieved

For now, I've confirmed that the Linux kernel, turned into a WebAssembly module, seems to be operating correctly.

  • Newly implemented the application side using C++20 vocabulary instead of the standard pthread/Win32 that comes with LKL, making it more portable — using the exact same implementation for both Windows and POSIX versions.
  • Confirmed a sequence of operations: a user-land process (disguised as a host thread) write(2)s to an fd created with pipe(2), and then reading it from the host. Examples of creating multiple user processes within an LKL kernel should be quite rare.
  • Using wasm2c output in a multi-threaded environment. This requires a small hack, as mentioned later.

... Well, you can't really do anything with just the kernel, but they say a journey of a thousand miles begins with a single step.

Build

I haven't set up a proper build system yet because I didn't have much time. I'll consider it a future task.

  1. Build preparation: Create .config with make ARCH=lkl menuconfig — the linked file includes USB/IP, but as mentioned later, including this somehow causes it to stop working on Visual Studio.
  2. runbuild.sh: Compiles Linux using its Makefile to generate lib.a (support library within the kernel) and vmlinux.a (the actual kernel body).
  3. buildline.sh:
    • As a countermeasure for the issue where the linker script mentioned later cannot be used with the WebAssembly toolchain, it extracts the list of functions to be executed during kernel initialization from vmlinux.a using llvm-nm and generates initsyms.gen.c.
    • Passes it to Clang along with the created glue code, links it with wasm-ld (the LLD linker for Wasm), and obtains lin.wasm, the WebAssembly-module-fied Linux kernel.
    • Converts lin.wasm into C99 source code with wasm2c.
  4. CMakeLists.txt: Compiles the generated lin.c with Visual Studio along with runtime code and test logic to obtain the executable.

Steps up to 3 produce two files: lin.c and lin.h. This is the Linux kernel converted into C source code. If you copy these to a Windows machine, you can perform step 4 on the Windows side. You cannot check out the Linux source code on Windows — forbidden filenames like aux are used, and a case-sensitive file system is required — well, you could check it out with Cygwin in an environment where WSL is already installed...

What is LKL?

Please refer to the IIJLab explanation below ↓

https://speakerdeck.com/thehajime/iijlab-seminar-linux-kernel-library-reusable-monolithic-kernel-in-japanese

LKL allows the Linux kernel to be linked into a normal application. By implementing various operations in lkl_host_operations (such as mutex locking and thread creation) on the application side, it makes it easy to implement Linux kernel features like file systems and TCP/IP stacks into applications.

Why wasm2c?

Using Wasm2c included in WABT (https://github.com/WebAssembly/wabt/), you can convert WebAssembly binaries into C source code. In this project, I used it to convert the Linux kernel into a single, massive (102MiB) C source code file and link it to the application.

The reason I chose Wasm2c over other Wasm runtimes is that LKL requires functionalities like setjmp/longjmp, threads, and TLS, which existing WebAssembly runtimes like v8 or Wasmer cannot support well. Since wasm2c outputs straightforward C code, these features can be implemented on the application side if you understand the output. Also, after compiling with a C compiler, you can debug with native debuggers like gdb or LLDB. This means you can perform normal debugging tasks like setting breakpoints on functions or setting data breakpoints (watchpoints) to stop on invalid memory access (extremely important).

For example, you can observe the backtrace of each kernel thread simultaneously in Visual Studio's Parallel Stacks view.

(↑ If you look closely, w2c_kernel_* are symbols from inside the Linux kernel.)

However, since the wasm2c output contains almost no debug information from the original program (only symbols remain), you are forced into extremely painful debugging, equivalent to debugging at the assembly level. Though, we are trained to handle that kind of debugging...

(↑ Variable names are not displayed even when stopped in the debugger.)

The reason I experimented with WebAssembly immediately, without trying LKL natively first, is because I'm a former kernel engineer and I predicted that I probably wouldn't have trouble with the kernel-side code. If you aren't extremely confident in your kernel skills, you should probably try it with native code first instead of jumping straight into wasm2c. However, as mentioned later, WebAssembly is missing things that normal CPUs usually have, so the tricky part is that you might work hard only to find out it won't work on Wasm anyway...

Also, I didn't use Emscripten this time; I used Clang and LLD directly and ran it only with the wasm2c runtime. I used the same configuration for a previous DOOM porting experiment, but this time I didn't even provide a libc. Emscripten allows you to choose between a browser-oriented runtime written in JavaScript or WASI, but since the Linux kernel doesn't depend on external libraries (it has its own sprintf, etc.), there's no opportunity for Emscripten's runtime library to be useful. Therefore, I decided Emscripten was unnecessary. Though, as mentioned later, __builtin_return_address() was a bit of a loss...

Porting Process

I feel like it was a good opportunity to review C language implementations, as things that are naturally present in ELF were missing in WebAssembly. In a normal life, you probably wouldn't encounter an environment where computed goto cannot be used. (Though you can't use it in Visual Studio anyway... grumble.)

Implementation of lkl_host_operations (Porting Layer)

To run LKL, various operations such as thread creation and mutex locking must be provided by the application side. While the LKL source tree includes reference implementations for pthread and Win32, I implemented it from scratch this time, considering the possibility of passing the entire system through Emscripten or something similar.

The features are divided into small parts: synchronization objects, threads and TLS, malloc, timers, setjmp/longjmp, and others to different categories.

Functionalities equivalent to thread termination (pthread_exit(3)) or setjmp/longjmp do not exist in the C++ standard library, so I substituted them with C++ exceptions. Since wasm2c code doesn't have properties that would cause issues if destructors aren't called midway, throwing exceptions directly is fine. This might not be the case for other WebAssembly runtimes. In either case, if you want to achieve this in general WebAssembly, you would likely need to convert it into interruptible code using Binaryen's Asyncify.

For example, the implementation of thread_exit is achieved using C++ try-catch and throw. A dedicated exception class thread_exit is prepared, and the thread code is executed inside a try block:

https://github.com/okuoku/lkl-wasm/blob/407b7298d75cfbb2d81b35558015e73715581baf/_hostwasm/runner/runner.cpp#L314-L322

When you actually want to terminate the thread, you just need to throw thread_exit.

https://github.com/okuoku/lkl-wasm/blob/407b7298d75cfbb2d81b35558015e73715581baf/_hostwasm/runner/runner.cpp#L366-L374

By the way, setjmp and longjmp are implemented in the same way, but I'm not entirely sure if LKL's usage of these adheres to the C standard. For C++ exceptions to substitute them, the setjmp/longjmp usage must follow the C standard—specifically, they must be used strictly for stack unrolling. There is a common misconception that these can be used for general green thread implementations, and like the variadic function casting mentioned later, they usually work fine on most CPUs.

Memory Management

LKL requests one large malloc at startup and uses it as the kernel's address space. This is 64MiB by default, and since I haven't implemented kernel command-line parsing in the WebAssembly port yet, it cannot be changed.

Furthermore, since it's necessary to separately allocate stack regions and buffers for user-land processes, I set aside about 1GiB and divide it internally using mempoolite https://github.com/jefyt/mempoolite.

mempoolite is a malloc implementation extracted from SQLite's memsys5 allocator. The memory to be divided is obtained at startup using the wasm2c runtime.

https://github.com/okuoku/lkl-wasm/blob/407b7298d75cfbb2d81b35558015e73715581baf/_hostwasm/runner/runner.cpp#L874-L875

wasm_rt_grow_memory expands the linear memory. This is an operation that originally corresponds to WebAssembly's memory.grow instruction.

Multi-threading Support

An important highlight of this experiment might be that real multi-threading was achieved with wasm2c output. Although wasm2c itself partially supports the threads proposal, there is no thread support on the runtime side at all, making this use case very rare. Well, maybe not as rare as running the entire Unity engine through wasm2c like I did before...

When you convert a C/C++ program to Wasm using Clang, the stack is represented by a global variable called __stack_pointer. If you manage this so it becomes a thread-local variable, the wasm2c output can be used simultaneously by multiple threads without issues.

All functions output by wasm2c take a pointer to the instance as the first argument.

u32 w2c_kernel_syscall(w2c_kernel*, u32, u32, u32);

Global variables in the Wasm sense are contained within the instance structure (in this case, w2c_kernel declared in the header generated by wasm2c). By declaring the instance pointer as thread_local and assigning a different instance and stack pointer to each thread, I ensure that __stack_pointer is thread-local.

https://github.com/okuoku/lkl-wasm/blob/407b7298d75cfbb2d81b35558015e73715581baf/_hostwasm/runner/runner.cpp#L63-L64

...Naturally, since the thread's stack region also needs to exist within Wasm's linear memory, I allocate it anew.

The stack layout when generating Wasm with Clang is as described in the Basic C ABI proposal, where the stack pointer for a new thread points to the end of the expanded memory page minus alpha.

The current code does not reclaim the stack region even after a thread terminates, resulting in a memory leak. A mechanism to recycle the stack regions of terminated threads will be necessary.

(I handled multi-threading relatively seriously this time, but LKL is inherently a non-SMP kernel, and like historical UNIX, it's constrained to at most one thread executing inside the kernel at a time. However, since mutual exclusion is performed using host OS synchronization primitives, moments where threads run in parallel are unavoidable, necessitating relatively proper thread support.)

Multi-process Support

LKL is basically designed with the assumption of "using kernel functions as a library." While it is possible to run a single Linux application on top of it (via the Hijack library), it doesn't seem to be intended for running multiple user-land processes like a typical Linux kernel.

Therefore, I modified the kernel side to add an interface for process creation.

When using a syscall within LKL from an application, a dummy thread called a host thread is used. This is implemented as a special thread that is a kernel thread but has no actual code inside the kernel. In other words, the thread has an empty function host_task_stub as its entry point, and this entry point is specially handled within arch/lkl (for example, so that the copy_thread process ends midway).

https://github.com/lkl/linux/blob/3023e6f25fbf6d5f95b4e7ebd011fa688434ce5f/arch/lkl/kernel/syscalls.c#L64

https://github.com/lkl/linux/blob/3023e6f25fbf6d5f95b4e7ebd011fa688434ce5f/arch/lkl/kernel/threads.c#L122-L125

Now, when creating this host thread, CLONE_THREAD is specified as a flag:

https://github.com/lkl/linux/blob/3023e6f25fbf6d5f95b4e7ebd011fa688434ce5f/arch/lkl/kernel/syscalls.c#L52-L53

As described in the clone(2) man page, this flag has the effect of adding a thread to the calling process. In the usual LKL situation, it's created as a child thread of init(8) (pid == 1), which is the host thread initially set up as host0.

...In other words, if we provide a separate way to create a host thread without specifying CLONE_THREAD, we can create a host thread with a new pid (and an independent file descriptor table). This time, I prepared host thread generation routines for new processes and new threads as the wasmlinux_newtask function on the Linux kernel side:

https://github.com/okuoku/lkl-wasm/commit/3847b0a54e69191fba424a2e26ab2436f0882d4c

And I implemented it to be used from the user-land side.

Linker Script Unusable Issue

This was truly a struggle. In normal Linux, including LKL, the kernel collects pointers to functions that perform initialization at boot time into a specific section and executes them sequentially by treating them as an array.

https://github.com/lkl/linux/blob/3023e6f25fbf6d5f95b4e7ebd011fa688434ce5f/include/asm-generic/vmlinux.lds.h#L951-L956

In normal CPU architectures, this process is achieved with a linker script as shown above, but WebAssembly linkers do not support linker scripts, which makes it an interesting point where the approach differs for each porter.

https://twitter.com/FriedChicken/status/1715657974771789911

As seen above, others are struggling with this too. Incidentally, the pioneer below handwrote it entirely.

LKL itself is updated quite frequently, and honestly, applying a large patch to the Linux side is daunting. Therefore, I implemented a procedure to temporarily prepare continuous dummy variables and copy them there at startup.

  1. Since the Linux build system leaves the pre-linked Linux as vmlinux.a, I extracted the symbol list by running llvm-nm on it.
  2. Prepared a table sorted in the order of invocation and an array of function pointers to generate C source code — generated source code.
  3. Copied from the table to the array during kernel startup.

Furthermore, in normal Linux, initialization functions are marked as static and cannot be referenced externally, so I patched the kernel side to remove it specifically for __wasm__.

I've also applied linker script countermeasures for the jiffies alias name definition, the initial thread's stack, and the scheduler class order.

Well, honestly, if it's limited to Wasm, creating the linker itself isn't that difficult, so creating a custom linker might be the most robust solution...

Computed Goto Unusable Issue

In GNU-based C, there is a feature to assign labels used for goto targets to variables. On normal CPUs, what is actually assigned at that time is the actual address of the CPU instruction. ...However, since individual instructions in WebAssembly do not have memory addresses, this label assignment feature (computed goto) cannot be used.

The Linux kernel uses this computed goto feature in many places to display the address at the time of execution for debugging, so I patched it appropriately so that it is not used.

https://github.com/okuoku/lkl-wasm/commit/0cb256b6923684cb5b94d8ce9f040c6baa95870f

Also, currently Clang cannot use __builtin_return_address without Emscripten, so I removed it as well.

Dependency on C Undefined Behavior

LKL provides the lkl_syscall function to allow calling Linux kernel syscalls. (However, the intended usage for actual LKL-based applications is to call syscalls like read(2) through a library prepared for applications by LKL—though I didn't use any of the user-land libraries provided by LKL this time.)

However, the implementation of lkl_syscall has an issue: it depends on C undefined behavior. The substance of lkl_syscall is simply calling a function pointer stored in the syscall_table array, but:

https://github.com/lkl/linux/blob/3023e6f25fbf6d5f95b4e7ebd011fa688434ce5f/arch/lkl/kernel/syscalls.c#L43-L44

The cast to syscall_handler_t triggers undefined behavior.

https://github.com/lkl/linux/blob/3023e6f25fbf6d5f95b4e7ebd011fa688434ce5f/arch/lkl/kernel/syscalls.c#L22-L28

When converting and calling a C function pointer, the types must be "compatible" (6.3.2.3):

If a converted pointer is used to call a function whose type is not compatible with the referenced type, the behavior is undefined.

One of the conditions for function pointers to be compatible (6.7.6.3) is that the use of an ellipsis (...) must match.

For two function types to be compatible, both shall specify compatible return types. Moreover, the parameter type lists, if both are present, shall agree in the number of parameters and in use of the ellipsis terminator ; corresponding parameters shall have compatible types.

In this case, the cast is to syscall_handler_t, which is a variadic function type, even though all syscall handlers have a fixed number of arguments. This results in undefined behavior in C. In practice, this does not work correctly in WebAssembly (due to the ABI conventions used by Clang within WebAssembly).

I addressed this by creating separate APIs that explicitly pass the number of arguments.

https://github.com/okuoku/lkl-wasm/commit/44aa5f2142c206f938d5652b434a3f8990825da2#diff-8afcdbcd97e3482daecbf8b58daf2661a51b9c66d2272330d348ff9f64ecfa76R61-R98

Since a Linux syscall has at most 6 arguments, providing 7 patterns of function pointers (0 to 6 arguments) is sufficient. Thus, I prepared types from syscall_handler0_t to syscall_handler6_t and cast to the correct type before calling.

...One might argue that if I've investigated this far, I should upstream the fix. However, this cast—specifically casting between a fixed-argument integer function and a variadic integer functionworks correctly on most CPU ABIs. Since I'm essentially the only one using Linux syscalls in WebAssembly, I'm holding off on submitting a patch just for that.

Remaining Tasks

When incorporating USB/IP with Visual Studio + Win10, it fails to boot. I suspect this is because USB/IP depends on the timer and expects it to be working correctly before /init starts. In fact, it doesn't work correctly even when running on Linux, so I've been using a workaround; I'll need to investigate this seriously to get USB working. Well, it might be easier to just postpone the initialization of USB/IP as it's a bit tedious.

The linker used, LLD (wasm-ld), somehow doesn't support --start-group or --end-group, so the Linux Makefile currently doesn't complete successfully. I've left it as is since it doesn't cause practical issues for now.

make -f ./scripts/Makefile.vmlinux_o
wasm-ld: error: unknown argument: --start-group
wasm-ld: error: unknown argument: --end-group
make[2]: *** [scripts/Makefile.vmlinux_o:61: vmlinux.o] Error 1

Conclusion

It was easy (KONAMI style).

What is it useful for?

To be honest, I think defining a Linux user-land ABI on WebAssembly is 100 times more important and useful than something like this. In practice, there's no need to make the kernel itself WebAssembly; there should be more demand for an environment that executes binaries distributed in WebAssembly on a Linux kernel running on ARM or RISC-V.

Since WASI and WASIX (https://wasix.org/) lack crucial functionalities like mmap, they are quite limited as a foundation for distributing common binaries. Just as there's a technique to run ARM binaries in Docker on amd64 using qemu-user, I think it would be better to just define the WasmLinux ABI (without the kernel) and execute WebAssembly apps directly via an ABI translation layer.

Even if someone wanted to run it in a web browser, I think running the kernel on an x86 emulator would be superior in terms of performance and usability. Either way, to run typical apps, mmap must be emulated.

The argument "Then why don't you do it?" is a very fair one, but this is extremely difficult to balance between compromise and pursuit. WebAssembly is a very unique architecture with functional gaps that the existing Linux ABI doesn't expect (such as function pointers not being memory addresses). Deciding whether to emulate those or treat them as constraints requires a lot of "charisma" to steer the direction...

As a benefit of running as many parts as possible on WebAssembly, it might make supporting custom CPUs easier. Developing an LLVM backend for a custom CPU or preparing a complete C ABI is quite an ordeal, but if you just need to port a WebAssembly runtime to your custom CPU, it might not be that hard.

Next Steps

Since I have already ported MUSL libc, I'm thinking if I can run user-land binaries by making them into DLLs... Dreaming big, I want to aim for full BusyBox execution.

To achieve this, things like:

  • Implementation of heap allocation methods such as sbrk
  • Syscall emulation related to processes/threads like clone and vfork
  • Signal-related features

and so on are necessary. Since these can't be implemented with LKL alone, I'd like to manage them by implementing my own simple microkernel.

Regarding implementation, I expect it would take a form where some syscalls are filtered and redirected to a custom implementation, similar to PRoot (https://github.com/proot-me/proot).

As for the file system, well, something simple will do for now... since pipe communication with the host system is already working, I could prepare something proper using FUSE.

In fact, serious mmap emulation isn't strictly necessary in the short term — BusyBox doesn't depend on mmap. For a real implementation, techniques like using the host OS's mmap (hardmmu) or converting all loads and stores via a virtual TLB, similar to copy/v86 (softmmu), would be required. Softmmu is essential for web browser support, but it's correspondingly difficult.

...If you push this further, it might eventually lead toward implementing a Linux-compatible kernel. Especially in a web browser, the motivation to use Linux device drivers is extremely low anyway. So, I feel like a good approach would be to first implement it by leveraging Linux (LKL) as much as possible, and then eventually replace it with a custom Linux-compatible kernel.

Discussion