iTranslated by AI
Implementing Basic GPU Support in My Custom Programming Language
This article is for Day 20 of the Mie University Computing Research Group Advent Calendar 2020.
Background / Overview
Previously, I have worked on GPGPU with Rust and self-hosting with a custom language. Having gathered the necessary knowledge, I decided I wanted to run a GPU using my own custom language.
I managed to implement x86_64 assembly and PTX code generation, as well as execution on the GPU. I'd like to write about the knowledge I needed and my impressions.
The repository is at https://gitlab.com/spica314/neco2. The compiler is implemented in Rust. I considered using C this time, but the lack of things like #[derive(Debug)] was just too painful.
As for what I've achieved, I was able to calculate Image 1 from Chapter 2 of Ray Tracing in One Weekend using the GPU. Yay! (Link: Source code written in the custom language)
...To be honest, I wanted to get it working up to around Chapter 6, but since I haven't yet implemented conditional branches, loops, or allocating structs to (PTX-level) registers in the GPU code generation, this is as far as I've gotten for now.
Overview of the Custom Language
I was going to write a "Hello, world", but string literals aren't implemented yet.
Syntactically, it's a language that looks like Rust with the parentheses for function calls stripped away and some mystery added. However, since I haven't implemented anything like Haskell's $ operator yet, it's still full of parentheses anyway.
This time, I implemented a basic type inference, so let can now be used without type annotations.
A collection of disorganized test programs: https://gitlab.com/spica314/neco2/-/tree/0dbbc599ec0afa91db134242db6feb3b0bbc02ea/examples
exit 42
fn main () = {
__mov __rax 60;
__mov __rdi 42;
__syscall ();
}
Something like inline assembly is not written as strings. I plan to eventually introduce asm blocks similar to unsafe blocks.
exit_group 42 (GPU ver.)
#[ptx]
fn f (x: &[i32; 32]) = {
let thread_id = __ctaid_x * __ntid_x + __tid_x;
(*x)[thread_id] = 6 * thread_id;
}
fn main () = {
__init_cuda ();
let xs = [0i32; 32];
__call_cuda f &xs 1 1 1 32 1 1;
let x = xs[7];
__mov __rax 231;
__mov __rdi x;
__syscall ();
}
This is the GPU version? By ensuring the i-th element of the array contains 6 * i, the 7th element results in 42. The function f is executed on the GPU.
Compilation Flow
- Generate PTX code for the GPU (PTX code is described later)
- Generate x86_64 assembly
- The generated PTX code is included in the assembly as a string
- Execution on the GPU uses the CUDA Driver API. Details later
- Assemble and link the generated assembly using gcc.
- e.g.
gcc -o a.out t.s /opt/cuda/lib64/stubs/libcuda.so
- e.g.
I'll write about 2 first.
2. Generating x86_64 Assembly
This part is the same as usual(?).
Main references used:
- Introduction to C Compiler Construction for Those Who Want to Know the Lower Layers (低レイヤを知りたい人のためのCコンパイラ作成入門), https://www.sigbus.info/compilerbook
- Especially when I got confused about the stack and stack pointer
- LINUX SYSTEM CALL TABLE FOR X86 64, https://blog.rchapman.org/posts/Linux_System_Call_Table_for_x86_64/
- System call numbers and register usage
Also, this time I wrote the parser part by referring to Rust's syn crate.
While referring to it, I thought the following style, found at https://docs.rs/syn/1.0.56/syn/parse/index.html, was cool:
impl Parse for ItemStruct {
fn parse(input: ParseStream) -> Result<Self> {
let content;
Ok(ItemStruct {
struct_token: input.parse()?,
ident: input.parse()?,
brace_token: braced!(content in input),
fields: content.parse_terminated(Field::parse_named)?,
})
}
}
It's nice not having to write the type in parse(), and I imagine that the processing for when it fails midway is handled on the ParseStream::parse() side (which probably calls Parse::parse() internally). Previously, I wanted to use ? but couldn't implement the logic to revert state, which made the code messy. This time, by adopting this style, the program's appearance has significantly improved. Probably.
I'll write about the CUDA Driver API after section 1.
1. Generating PTX Code
PTX is an Instruction Set Architecture (ISA) for NVIDIA GPUs. Personally, it feels more fitting to call it an intermediate language.
For example, if you compile a CUDA program like this:
extern "C" {
__global__ void copy_array(int *xs, int *ys) {
int thread_id = blockIdx.x * blockDim.x + threadIdx.x;
xs[thread_id] = ys[thread_id];
}
}
into PTX code using nvcc --ptx hoge.ptx, it looks like the following:
.visible .entry copy_array(
.param .u64 copy_array_param_0,
.param .u64 copy_array_param_1
)
{
.reg .b32 %r<6>;
.reg .b64 %rd<8>;
ld.param.u64 %rd1, [copy_array_param_0];
ld.param.u64 %rd2, [copy_array_param_1];
cvta.to.global.u64 %rd3, %rd1;
cvta.to.global.u64 %rd4, %rd2;
mov.u32 %r1, %ctaid.x;
mov.u32 %r2, %ntid.x;
mov.u32 %r3, %tid.x;
mad.lo.s32 %r4, %r1, %r2, %r3;
mul.wide.s32 %rd5, %r4, 4;
add.s64 %rd6, %rd4, %rd5;
ld.global.u32 %r5, [%rd6];
add.s64 %rd7, %rd3, %rd5;
st.global.u32 [%rd7], %r5;
ret;
}
I think it looks like something that can be generated.
By the way, .reg doesn't need to be grouped at the beginning; it works even if written separately (e.g., <6> defines 1 through 6 collectively, but they can be separate). Also, while it looks like SSA (Static Single Assignment), that doesn't seem to be required.
One thing I found to be a bit of a trap is that a function like:
__global__ void f(int *xs, int *ys) {
// ...
}
in PTX code looks more like:
__global__ void f(int **xs, int **ys) {
// ...
}
(I'm not exactly sure what's happening internally). However, you don't need to worry about it too much(?); I think it will work if you generate ld.param.u64 and cvta.to.global.u64 in a boilerplate-like fashion. Probably.
When generating, I referred to these two resources:
- Small CUDA programs written by myself and compiled with nvcc
- PARALLEL THREAD EXECUTION ISA v7.0, https://docs.nvidia.com/pdf/ptx_isa_7.0.pdf
During debugging, passing the generated PTX code to the ptxas command provides helpful error messages.
2. (Continued) Running PTX Code Using the CUDA Driver API
To run the generated PTX code on a GPU, use the CUDA Driver API. The documentation for the CUDA Driver API is at https://docs.nvidia.com/cuda/cuda-driver-api/index.html.
Listing the function names I used:
- cuInit
- cuDeviceGet
- cuCtxCreate_v2
- cuModuleLoadData
- Pass the generated PTX code here
- cuModuleGetFunction
- cuMemAlloc_v2
- cuMemcpyHtoD_v2
- cuLaunchKernel
- cuMemcpyDtoH_v2
Calling them in this order allows you to execute the PTX code (and transfer data, etc.).
Since my custom compiler doesn't support the System V ABI yet, I'm currently outputting hand-written assembly for the initialization and calling parts.
To call functions from a shared library, it seems good to append @PLT to the function name.
By the way, when using the CUDA Driver API, you must use the exit_group system call (number 231) instead of the exit system call (number 60), or the execution might hang (this might be because I'm skipping calls to cleanup functions like Destroy).
3. Assembly and Linking
Finally, the generated assembly is assembled and linked using gcc.
Excluding the GPU-related parts, the generated assembly does not depend on libc or anything else. I originally wanted to process it using as and ld, but since libcuda.so seems to require the C runtime, I settled on using gcc.
(This concludes the section on the compilation flow.)
On Rewriting the Program, etc.
For this version, I discarded the program I had previously self-hosted and rewrote it from scratch.
The reason was that I attempted to introduce a somewhat decent IR (Intermediate Representation) into my previous compiler, but the amount of rewriting required was so massive that I decided to start over from zero. If you plan to use an IR, it might be better to write it with that in mind from the start.
Also, while my previous goal was self-hosting, I did not make self-hosting a goal this time.
The reason is that when adding features, you end up with two sets of code: one for bootstrapping and one for self-hosting. Since the self-hosting version lacks many language features, I first write the code for the bootstrapping version. Then, I have to rewrite the self-hosting version to use the new features, add the features... it became a double effort.
I intend to pursue self-hosting once the language features are sufficiently mature (who knows when that will be...).
Future?
- First, I want to run "Ray Tracing in One Weekend" all the way to the end on the GPU using my custom language.
- I haven't touched anything like a standard library yet, so perhaps that?
- Architecturally/Technically? High-Level Synthesis, DPU (is it even possible?), Wasm, OpenQASM, etc...
Final Thoughts
- I want to implement all sorts of computations in my own language (ambition).
- Compilers are fun!
Discussion