Closed8
NVIDIA GPUの電力制限をワットとクロックの両方で行う

Dual CPUで200V+2000W電源、RTX 3090 x 4の計算機で学習を回して、CPU, GPU, I/Oをフルロードすると落ちて再起動してしまう問題が発生

とりあえず以下を参考にワット制限だけかけてみる。
現在の電力を確認
nvidia-smi -q -d POWER
==============NVSMI LOG==============
Timestamp : Thu Aug 15 10:50:28 2024
Driver Version : 560.28.03
CUDA Version : 12.6
Attached GPUs : 4
GPU 00000000:02:00.0
GPU Power Readings
Power Draw : 231.52 W
Current Power Limit : 250.00 W
Requested Power Limit : 250.00 W
Default Power Limit : 350.00 W
Min Power Limit : 100.00 W
Max Power Limit : 350.00 W
Power Samples
Duration : 2.36 sec
Number of Samples : 119
Max : 353.47 W
Min : 123.39 W
Avg : 231.04 W
GPU Memory Power Readings
Power Draw : N/A
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
...
電力をとりあえず250Wにしてみる
sudo nvidia-smi -pl 250
Power limit for GPU 00000000:09:00.0 was set to 200.00 W from 350.00 W.
Warning: persistence mode is disabled on device 00000000:09:00.0. See the Known Issues section of the nvidia-smi(1) man page for more information. Run with [--help | -h] switch to get more information on how to enable persistence mode.
...
All done.

先ほどよりは起動時間が持つが、しばらくするとまた落ちる

200Wにしてもしばらくすると落ちる
2000Wに対して200W x 4 + CPU + その他なので、足りるはずなんだけどな...

クロック数を制限する方法もあるらしい

man nvidia-smi
すると、以下の項目を発見
-lgc, --lock-gpu-clocks=MIN_GPU_CLOCK,MAX_GPU_CLOCK
Specifies <minGpuClock,maxGpuClock> clocks as a pair (e.g. 1500,1500) that defines closest desired locked GPU clock speed in MHz. Input can also use be a singular de‐
sired clock value (e.g. <GpuClockValue>). Optionally, --mode can be supplied to specify the clock locking modes. Supported on Volta+. Requires root
--mode=0 (Default)
This mode is the default clock locking mode and provides the highest possible frequency accuracies supported by the hardware.
--mode=1 The clock locking algorithm leverages close loop controllers to achieve frequency accuracies with improved perf per watt for certain class of applica‐
tions. Due to convergence latency of close loop controllers, the frequency accuracies may be slightly lower than default mode 0.
-rgc, --reset-gpu-clocks
Resets the GPU clocks to the default value. Supported on Volta+. Requires root.
-ac, --applications-clocks=MEM_CLOCK,GRAPHICS_CLOCK
Specifies maximum <memory,graphics> clocks as a pair (e.g. 2000,800) that defines GPU's speed while running applications on a GPU. Supported on Maxwell-based GeForce
and from the Kepler+ family in Tesla/Quadro/Titan devices. Requires root.
-rac, --reset-applications-clocks
Resets the applications clocks to the default value. Supported on Maxwell-based GeForce and from the Kepler+ family in Tesla/Quadro/Titan devices. Requires root.
今回はAmpere世代なので、-lgcを使えば良さそう

サポートされているクロックを確認
nvidia-smi -q -d SUPPORTED_CLOCKS
max 1600 MHz, min 210 MHzにしてみる
sudo nvidia-smi -lgc 210,1600

とりあえず今の所安定。様子見
このスクラップは2024/09/22にクローズされました