Closed8

NVIDIA GPUの電力制限をワットとクロックの両方で行う

もりりんもりりん

Dual CPUで200V+2000W電源、RTX 3090 x 4の計算機で学習を回して、CPU, GPU, I/Oをフルロードすると落ちて再起動してしまう問題が発生

もりりんもりりん

とりあえず以下を参考にワット制限だけかけてみる。

https://blog.amedama.jp/entry/nvidia-smi-gpu-power-limit

現在の電力を確認
nvidia-smi -q -d POWER

==============NVSMI LOG==============

Timestamp                                 : Thu Aug 15 10:50:28 2024
Driver Version                            : 560.28.03
CUDA Version                              : 12.6

Attached GPUs                             : 4
GPU 00000000:02:00.0
    GPU Power Readings
        Power Draw                        : 231.52 W
        Current Power Limit               : 250.00 W
        Requested Power Limit             : 250.00 W
        Default Power Limit               : 350.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 350.00 W
    Power Samples
        Duration                          : 2.36 sec
        Number of Samples                 : 119
        Max                               : 353.47 W
        Min                               : 123.39 W
        Avg                               : 231.04 W
    GPU Memory Power Readings
        Power Draw                        : N/A
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
...

電力をとりあえず250Wにしてみる

sudo nvidia-smi -pl 250

Power limit for GPU 00000000:09:00.0 was set to 200.00 W from 350.00 W.

Warning: persistence mode is disabled on device 00000000:09:00.0. See the Known Issues section of the nvidia-smi(1) man page for more information. Run with [--help | -h] switch to get more information on how to enable persistence mode.

...

All done.
もりりんもりりん

先ほどよりは起動時間が持つが、しばらくするとまた落ちる

もりりんもりりん

200Wにしてもしばらくすると落ちる

2000Wに対して200W x 4 + CPU + その他なので、足りるはずなんだけどな...

もりりんもりりん

man nvidia-smiすると、以下の項目を発見

   -lgc, --lock-gpu-clocks=MIN_GPU_CLOCK,MAX_GPU_CLOCK
       Specifies  <minGpuClock,maxGpuClock>  clocks as a pair (e.g. 1500,1500) that defines closest desired locked GPU clock speed in MHz.  Input can also use be a singular de‐
       sired clock value (e.g. <GpuClockValue>).  Optionally, --mode can be supplied to specify the clock locking modes.  Supported on Volta+.  Requires root

       --mode=0 (Default)
                      This mode is the default clock locking mode and provides the highest possible frequency accuracies supported by the hardware.

       --mode=1       The clock locking algorithm leverages close loop controllers to achieve frequency accuracies with improved perf per watt for  certain  class  of  applica‐
                      tions. Due to convergence latency of close loop controllers, the frequency accuracies may be slightly lower than default mode 0.

   -rgc, --reset-gpu-clocks
       Resets the GPU clocks to the default value.  Supported on Volta+.  Requires root.

   -ac, --applications-clocks=MEM_CLOCK,GRAPHICS_CLOCK
       Specifies  maximum  <memory,graphics>  clocks as a pair (e.g. 2000,800) that defines GPU's speed while running applications on a GPU.  Supported on Maxwell-based GeForce
       and from the Kepler+ family in Tesla/Quadro/Titan devices.  Requires root.

   -rac, --reset-applications-clocks
       Resets the applications clocks to the default value.  Supported on Maxwell-based GeForce and from the Kepler+ family in Tesla/Quadro/Titan devices.  Requires root.

今回はAmpere世代なので、-lgcを使えば良さそう

もりりんもりりん

サポートされているクロックを確認

nvidia-smi -q -d SUPPORTED_CLOCKS

max 1600 MHz, min 210 MHzにしてみる

sudo nvidia-smi -lgc 210,1600
このスクラップは2024/09/22にクローズされました