☣️

RustでCUDA(Cublas)を使った行列計算を実装する

2023/03/31に公開

Docker

はじめに

Julia言語大好きマンなのだが、最近Rust言語で遊ぶことに夢中になっている。

でも...CUDAを使った演算がしてみたい！！JuliaのCUDA.jlみたいに簡単に使いたい！
と思ったのだが、クレートが見つからない。
そこで，実装してみようと思った。

想定読者

RustでCUDA libを使いたい人
Rustのビルドをコンテナで行いたい人

コード

基本的にここに載せているコードは，以下で管理しているリポジトリから引用している。

Matrix structの実装

・Matrix型を定義する。ここではサイズ可変のMatrixを定義している。

dataにArrayの値を保持している。Indexを指定して値を取り出す。
dimsには，Arrayの各軸の要素数を保持している。

具体的には，Array<f64, 2>型だとf64のArrayで，2次元の配列(内部的には一次元のVecだが)=行列に対応する。

#[derive(Debug, Clone)]
pub struct Array<T, const D: usize> {
    pub dims: [usize; D],
    pub data: Vec<T>,
}

pub type Vector<T> = Array<T, 1>;
pub type Matrix<T> = Array<T, 2>;

cublasのRustバインド

cublasのRustバインドが存在したので，これを使う。

存在しない場合はbindgenを使って生成することもできる。すごいね。

ちなみに以下のcusolverバインドはbindgenを使って生成してみました。(なさそうだったので)

cusolverで固有値計算とかしてみようとおもったのですが，cusolverの関数を呼び出すと
CUSOLVER_STATUS_INTERNAL_ERRORになってしまってうまくいかなかった。なんでだろう。

devcontainerを使った仮想環境での開発

VSCodeに組み込まれているdevcontainerをコマンドとして使う

.devcontainerの具体例は以下。

インストール方法は割愛

以下のREADMEやその他サイトを参照してください。
https://github.com/devcontainers/cli

基本的には以下のコマンドだけで良かったと思う。

sudo npm install -g @devcontainers/cli

dockerfile

devcontainer用のDockerfileはRust-GPUリポジトリにあるものをそのまま流用している。

devcontainerの設定ファイルを用意する

デフォルトの設定ファイルを改変しているので、無駄なところが多々あると思う。
最後の "--gpus", "all"を忘れると，GPUが使えない。
containerEnvでコンテナに環境変数を設定している。実行時にこの環境変数は有効となっているので，デバッグフラグなどに使うのがよいだろう。

{
	"name": "devcontainer CLI Demo CUDA Rust",
    "image": "cuda-rust:latest",
	"build": {
		"dockerfile": "Dockerfile"
	},

	"customizations": {
		// 👇 Config only used for VS Code Server
		"vscode": {
			"extensions": [
				"streetsidesoftware.code-spell-checker",
				"mutantdino.resourcemonitor"
			],
			"settings": {
				"resmon.show.battery": false,
				"resmon.show.cpufreq": false
			}
		},
		// 👇 Config only used for openvscode-server
		"openvscodeserver": {
			"extensions": [
				"streetsidesoftware.code-spell-checker"
			],
			"settings": { }
		}
	},
	
	// 👇 Dev Container Features - https://containers.dev/implementors/features/
	"features": {
		"ghcr.io/devcontainers/features/go:1": {
			"version": "1.18.4"
		},
		"ghcr.io/devcontainers/features/node:1": {
			"version": "16.15.1",
			"nodeGypDependencies": false
		},
		"ghcr.io/devcontainers/features/desktop-lite:1": { },
		"ghcr.io/devcontainers/features/docker-in-docker:1": { },
		// Optional - For tools that require SSH
		"ghcr.io/devcontainers/features/sshd:1": { }
	},

	// We are using appPort since forwardPorts not yet supported directly 
	// by the CLI. See https://github.com/devcontainers/cli/issues/22
	// A pre-processor can easily parse devcontainer.json and inject
	// these values as appropriate. We're omitting that for simplicity.
	"appPort": [
		// Expose SSH port for tools that need it (e.g. JetBrains)
		"127.0.0.1:2222:2222",
		// Port VS Code Server / openvscode-server is on
		8000,
		// Port for VNC web server contributed by the desktop-lite feature
		6080
	],
    "containerEnv": {
        "RUST_BACKTRACE=1": "1"
    },

	//"remoteUser": "vscode"
    "runArgs": [ 
        "--gpus", "all"
    ]
}

devcontainerを起動する

これを，.devcontainerがあるフォルダで実行する。
するとdockerコンテナが立ち上がることがdocker ps

devcontainer up --workspace-folder .

コンテナでcargo runを実行する

コンテナでcargo runするには，以下のようにdevcontainer execを使う。

devcontainer exec --workspace-folder . cargo run --features=cuda

--features=cudaとしているのは，これを指定しているときだけcublas-sysクレートを使用したコードを有効にしているためである。

このようにcudaなどの外部ライブラリに依存するようなcrateを作成するときには、devcontainerを使うことで開発へ集中することができる。

cublasの利用方法

引数

公式ページを参考にする。blasに使い慣れていないと大量の引数で混乱しがち。

MatrixのデータをGPUメモリに乗せたCuMatrixの実装

CUDAライブラリの計算関数を実行する際、Rustで実装したMatrix型のフィールドのdataをGPUメモリに転送する必要がある。
まずは行列のデータをGPUに転送したり、GPUからCPUメモリに転送する部分を実装しよう。
実装するといってもCUDAライブラリでAPIは提供されている。

cpu -> gpuの実装

とりあえず、f32型だけに実装する。

pub trait CPU {
    type Output;
    fn cpu(&self) -> Self::Output;
}

impl CPU for CuMatrix {
    type Output = Matrix<f32>;
    fn cpu(&self) -> Self::Output {
        let mut data = vec![0.0f32; self.rows*self.cols];
        let n = self.rows*self.cols*size_of::<c_float>();
        memcpy_to_host(data.as_mut_ptr(), self.data_ptr, n).unwrap();
        Matrix { rows: self.rows, cols: self.cols, data: data}
    }
}

gpu -> cpuの実装

pub trait GPU {
    type Output;
    fn gpu(&self) -> Self::Output;
}

impl GPU for Matrix<f32> {
    type Output = CuMatrix;
    fn gpu(&self) -> Self::Output {
        let n = self.rows*self.cols*size_of::<c_float>();
        let mut a_ptr: *mut f32 = malloc(n).unwrap();
        memcpy_to_device(a_ptr, self.data.as_ptr(), n).unwrap();
        CuMatrix { rows: self.rows, cols: self.cols, data_ptr: a_ptr }
    }
}

行列積Mulの実装

CuMatrixにMulトレイトを実装する。

cublasのcublasSgemm_v2を使って行列積を行う。
blasより手順が増えるが，基本的な流れはほぼ一緒。

行列をGPUにのせる
cublasHandleを初期化する
cublasの処理を呼ぶ
行列をGPUからCPUにコピーする

impl Mul<CuMatrix> for CuMatrix {
    type Output = Result<Self, CuMatrixError>;

    fn mul(self, other: Self) -> Self::Output {
        if self.cols != other.rows {
            return Err(CuMatrixError::UndefinedError("matrix size does not match.".to_string()));
        }

        let handle = CublasHandle::new().unwrap();
        let mut mat = Matrix::<f32>::zeros([self.rows, other.cols]);
        let mut result = mat.gpu();

        let m = self.rows;
        let n = other.cols;
        let k = self.cols; // = other.rows


        let alpha: c_float= 1.0;
        let beta: c_float= 0.0;

        let status = 
        unsafe {
            cublasSgemm_v2(
                handle.handle,
                cublasOperation_t::CUBLAS_OP_N,
                cublasOperation_t::CUBLAS_OP_N,
                m as i32,
                n as i32,
                k as i32,
                &alpha,
                self.data_ptr,
                m as i32,
                other.data_ptr,
                n as i32,
                &beta,
                result.data_ptr,
                k as i32,
            );
        };

        free(self.data_ptr).unwrap();
        free(other.data_ptr).unwrap();
        handle.destroy();
        Ok(result)
    }
}

使用例

以下のような感じでCUDAで行列積を計算できるようになった。

use linear_algebra::{Complex, Matrix, Vector};
use linear_algebra::{CPU,GPU};
let n = 3;
let x = Matrix::<f32>::rand([n, n]);
let y = Matrix::<f32>::rand([n, n]);
let cx = x.gpu();
let cy = y.gpu();
let cz = cx * cy;
let z = cz.unwrap().cpu();
println!("z: {}", z);

=> Result
z: 3x3 Array
 1.03856  0.81080  0.78834 
 1.12424  1.01148  0.45698 
 1.52945  1.29110  0.96300 

まとめ

・bindgenでcublasなどのライブラリは簡単に利用できる。
・よく使用されているcライブラリなどは、*-sysという名前でRust用のバインドが作成されている可能性があるので、あればそれを使うのがいいと思う。

Discussion

ログインするとコメントできます