😀

CUDA on WSLの実行速度をGoogle Colaboratoryと比較してみた

2021/10/16に公開

深層学習モデルで動かしてみたいリポジトリがあったのだが，どうにもLinux環境でないと実行できなさそうであった．そのため，ローカル環境にCUDA on WSLを導入してみたものの，ちょっと計算が遅いかなと感じていた．そこで，Google Colaboratoryで動かしてみることにした．せっかくなので，この2つの環境でどの程度計算速度に差が出るのか比較してみた．

実行したリポジトリ

https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts

前に紹介したNeural Source Filterの発展版のPytorch実装である．実行がシェルスクリプトで記述されていることもあり，Linuxで動作させたほうが現実的であると判断した．

Google Colaboratoryでの実行方法

リポジトリより

Take cyc-noise-nsf as an example:

# add $PWD to PYTHONPATH 
# you may also activate conda environment
$: bash env.sh 

# cd into the project and run the script
$: cd project/cyc-noise-nsf-4
$: bash 00_demo.sh

と実行するとデータセットのダウンロード，学習済みモデルを用いた波形生成，学習とテストを実行してくれる．今回は

Google DriveにリポジトリをClone
!pip install numpy --upgradeを行う（しないとスクリプト実行時にエラーが出る）

を行い，以下のように実行した．リポジトリの配置はお好きなようにどうぞ

%cd /content/gdrive/My\ Drive/neural_source_filter/neural_source_filter/ #リポジトリのディレクトリ
!bash env.sh
%cd ./project/cyc-noise-nsf-4/
!bash 00_demo.sh

スペック

ローカル環境

ローカル環境のスペックは以下の通り

CPU：Intel(R)Core(TM)i7-8700k 3.70GHz
メモリ：24.0GB
GPU：NVIDIA GeForce RTX 2080 Ti

Google Colaboratory

colaboratory上で以下のコマンドを実行して確認した．

!cat /proc/cpuinfo # CPU
!free -h # メモリ
!nvidia-smi # GPU

実行結果は以下の通り

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) CPU @ 2.00GHz
stepping	: 3
microcode	: 0x1
cpu MHz		: 2000.170
cache size	: 39424 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa
bogomips	: 4000.34
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) CPU @ 2.00GHz
stepping	: 3
microcode	: 0x1
cpu MHz		: 2000.170
cache size	: 39424 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa
bogomips	: 4000.34
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

              total        used        free      shared  buff/cache   available
Mem:            12G        571M         10G        944K        2.1G         11G
Swap:            0B          0B          0B
Sun Jul  5 01:59:09 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

実行速度

実行速度は処理に時間を要したため1epoch（5688iter）のみで比較する．以下が1epochにかかった時間，1iterあたりのforward及びbackwardの計算時間である．

	Local	Google Colaboratory
1epochの処理時間（h）	3.5	0.75
forwardの計算時間（s/iter）	0.51	0.04
backwardの計算時間（s/iter）	1.61	0.41

1epochの処理時間が4.6倍ほど処理が遅い…この差はさすがに許容できないなと思った．
また，forwardとbackwardに関しては，colaboratoryのほうはforwardに対してbackwardが10倍ほど計算時間がかかっているのに対して，ローカル環境では3倍程度となっている．これと処理速度の差に何か関係があるのか気になる．

原因の考察

正直なところ原因はわかっていない．しかし，考えられる要素として

ローカルは仮想環境上で実行している
GPUの性能差
CUDA on WSLがパフォーマンスに関して保証していない

あたりは原因として考えられる．ここは今後調べて行きたいところではある．

まとめ

現状，Linux上で深層学習をやるならGoogle Colaboratoryに軍配が上がる．しかし，Colaboratoryは12時間でセッションが切れてしまうため，大規模なネットワークの学習をするにはネットワークの保存や学習データの用意等を色々検討しなければならないのがネックである．
今後やっていきたいこととしては

Windows上で実行した場合と上記2つの性能比較
原因の調査

を行っていきたい．それでもColaboratoryのほうが性能良かったら課金も視野に入れようかと思案している．

GitHubで編集を提案

実行したリポジトリ

Google Colaboratoryでの実行方法

スペック

ローカル環境

Google Colaboratory

実行速度

原因の考察

まとめ

Discussion