Closed11

NVIDIA SHARPを使ってみる

nariaki3551nariaki3551

https://docs.nvidia.com/networking/display/sharpv370/setting+up+nvidia+sharp+environment

Setup Requirements

Prior to installing and using NVIDIA SHARP, make sure the following requirements are met.
Run Aggregation Manager using a "root user" as trusted entities.

  • Make sure onboard Subnet Manager is disabled in the managed switches. (Aggregation Manager is a central entity running on a dedicated server with a master Subnet Manager. This dedicated server cannot serve as a compute node.

計算ノード以外にもSubnet Manager (SM) と Aggregation Manager (AM) を管理するサーバーが一つ必要。

  • Configure TCP/IP before running NVIDIA SHARP and Aggregation Manager communicate over TCP/IP.
  • Run NVIDIA Switch-IB 2/NVIDIA Quantum/NVIDIA Quantum-2 switches with the supported firmware versions as specified in the Prerequisites section in the Release Notes (use ibdiagnet utility to check the installed firmware version on the switches).
  • Enabled IPoIB interface in compute servers in order to enable using UD multicast for result distribution in SHARP.

SHARPでの結果の分配にはUnreliable Datagram (UD)が使用される。

  • Unreliable Datagramは、信頼性のないデータ転送方式で、パケットが失われる可能性がありますが、低いレイテンシーと効率的なマルチキャスト通信を提供します。通常、リアルタイム性が重視され、データの損失が許容されるアプリケーションで使用されます。
  • この場合、マルチキャストのデータ配信にUDが利用されるため、複数の受信者に対して同時に結果が効率的に送信されます。
  • Make sure SHARP Aggregation Manager out-of-the-box subnets are configured with SM using the following routing engines:
    • Tree based topologies: updn, ar_updn, ftree, ar_ftree
    • DragonFly+ topology: dfp
    • Hypercube topologies: dor routing engine with dor_hyper_cube_mode enabled
nariaki3551nariaki3551
ibdiagnetの結果
$ sudo ibdiagnet
Running version:   "IBDIAGNET 2.12.0.MLNX.1214769","IBDIAG 2.1.1.1214769","IBDM 2.1.1.1214769","IBIS 9.0.0.a9099cb"
Running command:   ibdiagnet 
Running timestamp: 2024-09-21 05:07:11 UTC +0000

Switch label port numbering explanation:
  Quantum2 switch split mode: ASIC/Cage/Port/Split, e.g 1/1/1/1
  Quantum2 switch no split mode: ASIC/Cage/Port
  Quantum switch split mode: Port/Split
  Quantum switch no split mode: Port


----------
Load Plugins from:
/usr/share/ibdiagnet2.1.1/plugins/
(You can specify more paths to be looked in with "IBDIAGNET_PLUGINS_PATH" env variable)

Plugin Name                                   Result     Comment
libibdiagnet_cable_diag_plugin-2.1.1          Succeeded  Plugin loaded
libibdiagnet_phy_diag_plugin-2.1.1            Succeeded  Plugin loaded

---------------------------------------------
Discovery
-I- Using local IB device mlx5_1:1
-I- Start Fabric Discover
-I- Discovering ... 18 Nodes (1 Switches & 17 CAs) discovered.
-I- Fill NodeDesc data
-I- Retrieving... 18/18 Request Port Nodes (1/1 Switches & 17/17 CAs) retrieved.
-I- NodeDesc finished successfully 
-I- Fabric Discover finished successfully


-I- Fill PortInfo data
-I- Retrieving... 33/33 Request Port MADs (17/17 Switch Ports & 16/16 CAs Ports) retrieved.
-I- PortInfo finished successfully

-I- No scope/unhealthy ports files. Total switches/ports [1/41], CAs/ports [17/17]

-I- Build VS Capability SMP
-I- Build VS Capability FW Info SMP
-I- Retrieving... 18/18 Request Port Nodes (1/1 Switches & 17/17 CAs) retrieved.
-I- Build VS Capability Mask SMP
-I- Retrieving... 18/18 Request Port Nodes (1/1 Switches & 17/17 CAs) retrieved.
-I- VS Capability SMP finished successfully

-I- Build VS Extended Port Info
-I- Retrieving... 34/34 Request Port MADs (17/17 Switch Ports & 17/17 CAs Ports) retrieved.
-I- VS ExtendedPortInfo finished successfully

-I- Build VS Capability GMP
-I- Retrieving... 17/17 Request Port Nodes (1/1 Switches & 16/16 CAs) retrieved.
-I- VS Capability GMP finished successfully

-I- Build VS Port Info Extended
-I- Retrieving... 0/0 Request Port MADs (0/0 Switch Ports & 0/0 CAs Ports) retrieved.
-I- Port Info Extended finished successfully

-I- Build Switch Info
-I- Retrieving... 1/1 Request Port Nodes (1/1 Switches & 0/0 CAs) retrieved.
-I- Switch Info retrieving finished successfully

-I- Build Hierarchy Info
-I- Retrieving... 32/32 Request Port MADs (0/0 Switch Ports & 16/16 CAs Ports) retrieved.
-I- Hierarchy Info retrieving finished successfully

-I- Build AR Info
-I- Retrieving... 1/1 Request Port Nodes (1/1 Switches & 0/0 CAs) retrieved.
-I- AR Info retrieving finished successfully

-I- Duplicated GUIDs detection finished successfully

-I- Duplicated Node Description detection finished successfully

-I- Port Hierarchy Info finished successfully

---------------------------------------------
Lids Check
-I- Lids Check finished successfully

---------------------------------------------
Links Check
-I- Links Check finished successfully

---------------------------------------------
Subnet Manager
-I- SM Info retrieving finished successfully

-I- Subnet Manager Check finished successfully

---------------------------------------------
Port Counters
-I- Build PMClassPortInfo
-I- Retrieving... 18/18 Request Port Nodes (1/1 Switches & 17/17 CAs) retrieved.

-I- Build PMPortSampleControl
-I- Retrieving... 34/34 Request Port MADs (17/17 Switch Ports & 17/17 CAs Ports) retrieved.

-I- Build Port Counters
-I- Retrieving... 153/153 Request Port MADs (17/17 Switch Ports & 17/17 CAs Ports) retrieved.

-I- Ports counters retrieving finished successfully

-I- Retrieving... 0/0 Request Port MADs (0/0 Switch Ports & 0/0 CAs Ports) retrieved.
-I- RN counters retrieving finished successfully

-I- Retrieving... 0/0 Request Port MADs (0/0 Switch Ports & 0/0 CAs Ports) retrieved.
-I- HBF counters retrieving finished successfully

-I- Going to sleep for 1 seconds until next counters sample
-I- Time left to sleep ... 1 seconds.


-I- Build Port Counters
-I- Retrieving... 153/153 Request Port MADs (17/17 Switch Ports & 17/17 CAs Ports) retrieved.

-I- Ports counters retrieving (second time) finished successfully

-I- Ports counters value Check finished successfully

-I- Ports counters overflow value Check finished successfully

-I- pFRN Received Error check finished successfully

-I- Ports counters Difference Check (during run) finished successfully

-I- Ports counters delta check finished successfully

---------------------------------------------
Nodes Information
-I- Devid: 4123(0x101b), PSID: MT_0000000223, Latest FW Version:20.39.2048
-I- Devid: 54000(0xd2f0), PSID: MT_0000000062, Latest FW Version:27.2012.1010
-I- FW Check finished successfully

---------------------------------------------
Speed / Width checks
-I- Link Speed Check (Compare to supported link speed)
-I- Links Speed Check finished successfully

-I- Link Width Check (Compare to supported link width)
-I- Links Width Check finished successfully

---------------------------------------------
Virtualization
-I- Build Virtualization Info DB
-I- Retrieving... 0/0 Request Port MADs (0/0 Switch Ports & 0/0 CAs Ports) retrieved.

-I- Build VPort Info DB
-I- Retrieving... 0/0 Request Port MADs (0/0 Switch Ports & 0/0 CAs Ports) retrieved.

-I- Build VPort Info DB
-I- Retrieving... 0/0 Request Port MADs (0/0 Switch Ports & 0/0 CAs Ports) retrieved.

-I- Build VPort GUID Info DB
-I- Retrieving... 0/0 Request Port MADs (0/0 Switch Ports & 0/0 CAs Ports) retrieved.

-I- Build VNode Info DB
-I- Retrieving... 0/0 Request Port MADs (0/0 Switch Ports & 0/0 CAs Ports) retrieved.

-I- Build VPort PKey Table DB
-I- Retrieving... 0/0 Request Port MADs (0/0 Switch Ports & 0/0 CAs Ports) retrieved.

-I- Build Node Description DB
-I- Retrieving... 0/0 Request Port MADs (0/0 Switch Ports & 0/0 CAs Ports) retrieved.

-I- Virtualization finished successfully

-I- Virtual ports retrieving finished successfully

-I- Virtual ports retrieving finished successfully

---------------------------------------------
Partition Keys
-I- Retrieving... 90/90 Request Port MADs (18/18 Switch Ports & 17/17 CAs Ports) retrieved.
-I- Partition Keys retrieving finished successfully

-I- Partition Keys finished successfully

---------------------------------------------
Temperature Sensing
-I- Retrieving... 18/18 Request Port Nodes (1/1 Switches & 17/17 CAs) retrieved.
-I- Temperature Sensing finished successfully

---------------------------------------------
Routers
-I- Retrieving... 0/0 Request Port Nodes (0/0 Switches & 0/0 CAs) retrieved.
-I- Build Routers Info DB finished successfully

-I- Retrieving... 0/0 Request Port Nodes (0/0 Switches & 0/0 CAs) retrieved.
-I- Build Routers Tables finished successfully


-I- Adjacent subnets FLID Table retrieving finished successfully) retrieved.


-I- Routers FLID Table retrieving finished successfully& 0/0 CAs) retrieved.

-I- Local subnet FLID verification finished successfully

-I- Skipping FLID verification
---------------------------------------------
Post Reports Generation
-I- Writing of IBNetdDscover file finished successfully

---------------------------------------------
Fabric Summary

Total Nodes             : 18
IB Switches             : 1
IB Channel Adapters     : 16
IB Aggregation Nodes    : 1
IB Routers              : 0

Adaptive Routing is enabled on 1 switches.

Total number of links   : 17
Links at 4x50           : 17

Master SM: Port=1 LID=25 GUID=0xa088c2030025eb2c devid=4123 Priority:14 Node_Type=CA Node_Description=snail99 HCA-2
Standby SM : No Standby SM

---------------------------------------------
Summary
-I- Stage                               Warnings   Errors     Comment   
-I- Discovery                           0          0         
-I- Lids Check                          0          0         
-I- Links Check                         0          0         
-I- Subnet Manager                      0          0         
-I- Port Counters                       0          0         
-I- Nodes Information                   0          0         
-I- Speed / Width checks                0          0         
-I- Virtualization                      0          0         
-I- Partition Keys                      0          0         
-I- Temperature Sensing                 0          0         
-I- Routers                             0          0         
-I- Post Reports Generation             0          0         

-I- You can find detailed errors/warnings in: /var/tmp/ibdiagnet2/ibdiagnet2.log


-I- Database                            : /var/tmp/ibdiagnet2/ibdiagnet2.db_csv
-I- LST                                 : /var/tmp/ibdiagnet2/ibdiagnet2.lst
-I- Network dump                        : /var/tmp/ibdiagnet2/ibdiagnet2.net_dump
-I- Subnet Manager                      : /var/tmp/ibdiagnet2/ibdiagnet2.sm
-I- Ports Counters                      : /var/tmp/ibdiagnet2/ibdiagnet2.pm
-I- RN counters 2                       : /var/tmp/ibdiagnet2/ibdiagnet2.rnc2
-I- Nodes Information                   : /var/tmp/ibdiagnet2/ibdiagnet2.nodes_info
-I- VPorts                              : /var/tmp/ibdiagnet2/ibdiagnet2.vports
-I- VPorts Pkey                         : /var/tmp/ibdiagnet2/ibdiagnet2.vports_pkey
-I- Partition keys                      : /var/tmp/ibdiagnet2/ibdiagnet2.pkey
-I- IBNetDiscover                       : /var/tmp/ibdiagnet2/ibdiagnet2.ibnetdiscover
-I- IBLinkInfo                          : /var/tmp/ibdiagnet2/ibdiagnet2.iblinkinfo
nariaki3551nariaki3551

https://docs.nvidia.com/networking/display/sharpv370/setting+up+nvidia+sharp+environment

NVIDIA Quantum の Capabilities and limitationsについて

Supports both SHARP low latency and streaming aggregation operations
Supports up to 126 aggregation trees in the subnet (63 low latency trees, and 63 streaming aggregation trees)

とあるように、複数のaggregation tree を作成することはできるが、streaming aggregation を有効にできるのは一つのスイッチにつき一つのツリーのみ。


https://docs.nvidia.com/networking/display/sharpv370/using+nvidia+sharp+with+nvidia+nccl

NCCL SHARP Streaming aggregation is supported on a single NCCL communicator/process group (PG). Applications can selectively enable SHARP on specific Process Group (PG) by setting this variable in the application before creating the PG

NCCL SHARP ストリーミング集約は、単一の NCCL コミュニケータ/プロセスグループ(PG)でサポートされています。アプリケーションは、PG を作成する前にアプリケーション内でこの変数を設定することで、特定のプロセスグループ(PG)で SHARP を選択的に有効にすることができます。

nariaki3551nariaki3551

SHARPの実行テスト

https://docs.nvidia.com/networking/display/sharpv370/testing+nvidia+sharp+setup

$ /data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/bin/sharp_hello 
[snail02:0:1042245 - context.c:670][2024-09-21 05:31:34] INFO job (ID: 9370539465109193031) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[snail02:0:1042245 - context.c:867][2024-09-21 05:31:34] INFO sharp_job_id:1    resv_key: tree_type:LLT tree_idx:0  treeID:0 caps:0x6 quota:(osts:167 user_data_per_ost:1024 max_groups:167 max_qps:1 max_group_channels:1)
[snail02:0:1042245 - comm.c:400][2024-09-21 05:31:34] INFO [group#:0] job_id:1 group id:0 tree idx:0 tree_type:LLT rail_idx:0 group size:1 quota: (osts:8 user_data_per_ost:1024) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
Test Passed.
$ SHARP_COLL_ENABLE_SAT=1 /data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/bin/sharp_hello 
[snail02:0:1043911 - context.c:670][2024-09-21 05:35:21] INFO job (ID: 9370539465901257116) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[snail02:0:1043911 - context.c:867][2024-09-21 05:35:22] INFO sharp_job_id:1    resv_key: tree_type:LLT tree_idx:0  treeID:0 caps:0x6 quota:(osts:25 user_data_per_ost:1024 max_groups:25 max_qps:1 max_group_channels:1)
[snail02:0:1043911 - context.c:882][2024-09-21 05:35:22] INFO sharp_job_id:1    tree_type:SAT tree_idx:1  treeID:64 caps:0x16
[snail02:0:1043911 - comm.c:400][2024-09-21 05:35:22] INFO [group#:0] job_id:1 group id:0 tree idx:0 tree_type:LLT rail_idx:0 group size:1 quota: (osts:8 user_data_per_ost:1024) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
[snail02:0:1043911 - comm.c:400][2024-09-21 05:35:22] INFO [group#:1] job_id:1 group id:0 tree idx:1 tree_type:SAT rail_idx:0 group size:1 quota: (osts:64 user_data_per_ost:0) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
Test Passed.
nariaki3551nariaki3551

sharp_coll_testもあったので実行してみる。Avg BW をみると90%の帯域を使えている

$ /data/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/bin/sharp_coll_test -c allreduce -d mlx5_1,mlx5_2 -f 4

Allreduce perf test. comm_size:1
   #size(bytes)     Avg lat(us)     Min lat(us)     Max lat(us)    Avg BW(Gb/s)      iters
              4            2.10            1.98            2.43            0.02        100
             16            2.15            1.92           11.53            0.06        100
             64            2.16            1.98           13.23            0.24        100
            256            2.50            2.40            2.90            0.82        100
           1024            3.05            2.90            6.88            2.69        100
           4096            4.78            4.51           13.85            6.85        100
          16384            7.15            6.55           17.65           18.33        100
          65536           12.53           11.85           20.61           41.86        100
         262144           28.81           27.41           40.30           72.79        100
        1048576           96.86           92.75          121.21           86.61        100
        4194304          371.38          362.78          423.78           90.35        100
       16777216         1477.61         1434.89         1598.56           90.83        100
nariaki3551nariaki3551

$HPCX_SHARP_DIR/bin/sharp_coll_dump_config

で設定できる環境変数一覧を見れる。

nariaki3551nariaki3551

OpenMPI にもSHARP が使えるらしい... が速くはなっていない。

未使用
$ mpirun -n 8 -x LD_LIBRARY_PATH -H snail01:2,snail02:2,snail03:2,snail04:2 /data/fsdp_overlap/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/ompi/tests/osu-micro-benchmarks/osu_allreduce -m 1024:1073741824 -z -f
Warning! Limiting max message size to: 1073741824Increase -M, --mem-limit for higher message sizes.
# OSU MPI Allreduce Latency Test v7.4
# Datatype: MPI_FLOAT.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations  P50 Tail Lat(us)  P90 Tail Lat(us)  P99 Tail Lat(us)
1024                    8.45              8.23              8.90        1000              7.20             10.82             27.34
2048                   10.35              9.94             10.78        1000              8.79             14.61             25.94
4096                   15.44             15.08             15.84        1000             13.51             22.59             35.23
8192                   23.88             23.49             24.16        1000             19.41             33.59             58.69
16384                  31.24             30.87             31.65         100             28.63             39.48             53.10
32768                 195.29            193.21            197.76         100            192.60            210.84            232.45
65536                 224.67            219.85            228.97         100            215.67            260.00            304.06
131072                303.78            297.86            308.78         100            301.34            331.08            362.32
262144                429.41            418.18            436.64         100            420.65            476.29            554.43
524288                626.94            612.24            641.88         100            576.22            658.33            790.28
1048576               971.54            925.95            999.60         100            863.50            959.97           3243.55
2097152              1603.99           1583.96           1621.63         100           1416.29           1576.36           5274.27
4194304              2991.15           2917.57           3058.35         100           2633.46           3397.81           7639.12
8388608              4951.27           4896.73           4992.19         100           4768.67           5262.97           9166.15
16777216            10420.22          10325.19          10512.91         100          10085.35          11319.31          15075.35
33554432            23290.01          23070.09          23571.07         100          21960.80          26685.07          30998.52
67108864            49323.81          48339.28          49989.66         100          46279.67          58870.96          76369.06
134217728           97479.00          96315.22          98869.88         100          92695.67         114288.77         137050.02
268435456          188560.61         185910.55         190802.95         100         179097.65         218939.04         273986.66
536870912          383287.45         378003.60         388555.53         100         355080.02         401523.75        1346058.70
使用
$ SHARP_COLL_LOG_L=5 mpirun -n 8 -x HCOLL_ENABLE_SHARP=4 -x LD_LIBRARY_PATH -H snail01:2,snail02:2,snail03:2,snail04:2 /data/ata/fsdp_overlap/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/ompi/tests/osu-micro-benchmarks/osu_allreduce -m 1024:1073741824 -z -f
[snail02:0:1057126 - context.c:153][2024-09-21 06:03:09] TRACE init status:0 world_local_rank:0 group_channel_idx:0 
[snail02:1:1057127 - context.c:153][2024-09-21 06:03:09] TRACE init status:0 world_local_rank:1 group_channel_idx:0 
[snail02:0:1057126 - dev.c:265][2024-09-21 06:03:09] DEBUG num_ports:1 max_channels:1 num_trees:1
[snail02:0:1057126 - dev.c:267][2024-09-21 06:03:09] DEBUG [PORT:0]  name:mlx5_1  port_num:1
[snail02:0:1057126 unique id 3876323329][2024-09-21 06:03:10] DEBUG collect_ports_data: found valid device (device mlx5_1 port 1) in at index 0

[snail02:1:1057127 unique id 3876323329][2024-09-21 06:03:10] DEBUG collect_ports_data: found valid device (device mlx5_1 port 1) in at index 0

[snail02:0:1057126 - context.c:670][2024-09-21 06:03:10] INFO job (ID: 3876323329) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:2 max_group_channels:1, num_trees:1)
[snail02:0:1057126 - context.c:742][2024-09-21 06:03:10] DEBUG SHArP job info: sharp_job_id:1 job_data_len:2456 num_trees:1 max_group_channels:1
[snail02:0:1057126 - context.c:867][2024-09-21 06:03:10] INFO sharp_job_id:1    resv_key: tree_type:LLT tree_idx:0  treeID:0 caps:0x6 quota:(osts:167 user_data_per_ost:1024 max_groups:167 max_qps:2 max_group_channels:1)
[snail02:0:1057126 - dev.c:125][2024-09-21 06:03:10] DEBUG device multi path force cap is enabled
[snail02:1:1057127 - dev.c:125][2024-09-21 06:03:10] DEBUG device multi path force cap is enabled
[snail02:0:1057126 - dev.c:323][2024-09-21 06:03:10] DEBUG SHARP-RAIL[0]  device_name:mlx5_1, port:1
[snail02:0:1057126 - dev.c:499][2024-09-21 06:03:10] DEBUG sharp tree endpoint init. rail_idx :0 tree_idx:0 

[snail02:1:1057127 - dev.c:323][2024-09-21 06:03:10] DEBUG SHARP-RAIL[0]  device_name:mlx5_1, port:1
[snail02:1:1057127 - dev.c:499][2024-09-21 06:03:10] DEBUG sharp tree endpoint init. rail_idx :0 tree_idx:0 

[snail02:0:1057126 - context.c:1106][2024-09-21 06:03:10] DEBUG tree_idx:0 rail_idx:0 endpoint created on device :mlx5_1 port:1
[snail02:0:1057126 - utils/mpool.c:119][2024-09-21 06:03:10] DEBUG mpool sharp_buffer_mpool: align 128, maxelems 4294967295, elemsize 1616
[snail02:0:1057126 - utils/mpool.c:119][2024-09-21 06:03:10] DEBUG mpool sharp_coll_reqs: align 128, maxelems 4294967295, elemsize 176
[snail02:0:1057126 - utils/mpool.c:119][2024-09-21 06:03:10] DEBUG mpool sharp_coll_handles: align 128, maxelems 4294967295, elemsize 336
[snail02:0:1057126 - context.c:1198][2024-09-21 06:03:10] DEBUG PCI RELAXED ORDERING is disabled
[snail02:0:1057126 - dev.c:1141][2024-09-21 06:03:10] DEBUG NULL mr created key:700 device: mlx5_1
[snail02:0:1057126 - shared_utils.c:119][2024-09-21 06:03:10] DEBUG SHARP_COLL_LIB_PATH=/data/fsdp_overlap/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/lib
[snail02:1:1057127 - context.c:1106][2024-09-21 06:03:10] DEBUG tree_idx:0 rail_idx:0 endpoint created on device :mlx5_1 port:1
[snail02:1:1057127 - utils/mpool.c:119][2024-09-21 06:03:10] DEBUG mpool sharp_buffer_mpool: align 128, maxelems 4294967295, elemsize 1616
[snail02:1:1057127 - utils/mpool.c:119][2024-09-21 06:03:10] DEBUG mpool sharp_coll_reqs: align 128, maxelems 4294967295, elemsize 176
[snail02:1:1057127 - utils/mpool.c:119][2024-09-21 06:03:10] DEBUG mpool sharp_coll_handles: align 128, maxelems 4294967295, elemsize 336
[snail02:1:1057127 - context.c:1198][2024-09-21 06:03:10] DEBUG PCI RELAXED ORDERING is disabled
[snail02:1:1057127 - dev.c:1141][2024-09-21 06:03:10] DEBUG NULL mr created key:700 device: mlx5_1
[snail02:1:1057127 - shared_utils.c:119][2024-09-21 06:03:10] DEBUG SHARP_COLL_LIB_PATH=/data/fsdp_overlap/hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/sharp/lib
[snail02:0:1057126 - cuda_util.c:357][2024-09-21 06:03:10] DEBUG GPUDirect RDMA is disabled
[snail02:0:1057126 - utils/mpool.c:119][2024-09-21 06:03:10] DEBUG mpool CUDA Event objects: align 128, maxelems 128, elemsize 16
[snail02:0:1057126 - utils/mpool.c:119][2024-09-21 06:03:10] DEBUG mpool CUDA Stream objects: align 128, maxelems 16, elemsize 16
[snail02:0:1057126 - cuda_util.c:414][2024-09-21 06:03:10] DEBUG GDRCOPY wrapper lib not found. GDRCOPY is disabled. ret:2 
[snail02:0:1057126 - context.c:329][2024-09-21 06:03:10] DEBUG Cannot enable ROCm when CUDA is already enabled. Leaving ROCm disabled
[snail02:1:1057127 - cuda_util.c:357][2024-09-21 06:03:10] DEBUG GPUDirect RDMA is disabled
[snail02:1:1057127 - utils/mpool.c:119][2024-09-21 06:03:10] DEBUG mpool CUDA Event objects: align 128, maxelems 128, elemsize 16
[snail02:1:1057127 - utils/mpool.c:119][2024-09-21 06:03:10] DEBUG mpool CUDA Stream objects: align 128, maxelems 16, elemsize 16
[snail02:0:1057126 - context.c:349][2024-09-21 06:03:10] DEBUG sharp_coll initialized. job_id: 3876323329 init_time: 518582.531
[snail02:0:1057126 - context.c:1498][2024-09-21 06:03:10] DEBUG CAPS: pkt_version:1 dtypes:0x1ff tag_dtypes:0x1ff reduce_ops:0xff7feature_mask:0x1
[snail02:1:1057127 - cuda_util.c:414][2024-09-21 06:03:10] DEBUG GDRCOPY wrapper lib not found. GDRCOPY is disabled. ret:2 
[snail02:1:1057127 - context.c:329][2024-09-21 06:03:10] DEBUG Cannot enable ROCm when CUDA is already enabled. Leaving ROCm disabled
[snail02:1:1057127 - context.c:349][2024-09-21 06:03:10] DEBUG sharp_coll initialized. job_id: 3876323329 init_time: 518625.035
[snail02:1:1057127 - context.c:1498][2024-09-21 06:03:10] DEBUG CAPS: pkt_version:1 dtypes:0x1ff tag_dtypes:0x1ff reduce_ops:0xff7feature_mask:0x1
[snail02:1:1057127 - context.c:1349][2024-09-21 06:03:10] DEBUG External memory register, addr:0x7f04639f8000 len:20971520 device:mlx5_1
[snail02:0:1057126 - context.c:1349][2024-09-21 06:03:10] DEBUG External memory register, addr:0x7f88a49fa000 len:20971520 device:mlx5_1
Warning! Limiting max message size to: 1073741824Increase -M, --mem-limit for higher message sizes.
# OSU MPI Allreduce Latency Test v7.4
# Datatype: MPI_FLOAT.
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations  P50 Tail Lat(us)  P90 Tail Lat(us)  P99 Tail Lat(us)
1024                    8.22              8.00              8.38        1000              7.13              9.06             25.23
2048                   10.01              9.23             10.69        1000              8.59             12.72             24.60
4096                   15.85             15.45             16.47        1000             13.71             24.12             34.09
8192                   23.97             23.73             24.23        1000             19.50             34.47             49.84
16384                  32.01             31.44             32.60         100             28.17             40.84             63.18
32768                 194.92            187.03            199.81         100            188.56            222.48            250.73
65536                 220.45            213.23            225.10         100            215.28            243.54            293.20
131072                304.48            298.21            312.66         100            300.08            337.68            381.62
262144                524.34            513.44            541.02         100            429.61            488.68           4571.81
524288                621.23            609.32            646.78         100            577.45            648.20           2013.19
1048576               955.97            934.45            986.61         100            868.54            971.38           3151.71
2097152              1706.40           1680.68           1751.22         100           1478.48           1888.39           4973.19
4194304              3013.00           2941.97           3056.06         100           2622.79           3626.93           6967.23
8388608              5099.35           5021.64           5157.13         100           4878.14           5574.81           9242.37
16777216            10481.36          10321.47          10671.05         100          10263.71          11322.64          13768.48
33554432            22292.59          21875.13          22599.18         100          22167.36          23274.35          28474.58
67108864            49227.90          48272.39          50312.05         100          46516.84          60475.30          72910.10
134217728          103431.18         101660.72         104785.20         100          92676.84         120259.76         243521.45
268435456          186759.74         184071.52         188923.51         100         179776.49         216161.38         243806.73
536870912          368087.90         362324.00         372837.31         100         355567.76         388817.24         426731.73
[snail02:0:1057126 - context.c:1417][2024-09-21 06:04:38] DEBUG External memory deregister, addr:0x7f88a49fa000 len:20971520 device:mlx5_1
[snail02:1:1057127 - context.c:1417][2024-09-21 06:04:38] DEBUG External memory deregister, addr:0x7f04639f8000 len:20971520 device:mlx5_1
[snail02:0:1057126 - context.c:1417][2024-09-21 06:04:38] DEBUG External memory deregister, addr:(nil) len:18446744073709551615 device:mlx5_1
[snail02:0:1057126 - utils/mpool.c:173][2024-09-21 06:04:38] DEBUG mpool sharp_coll_reqs destroyed
[snail02:0:1057126 - utils/mpool.c:173][2024-09-21 06:04:38] DEBUG mpool sharp_coll_handles destroyed
[snail02:0:1057126 - utils/mpool.c:173][2024-09-21 06:04:38] DEBUG mpool sharp_buffer_mpool destroyed
[snail02:0:1057126 - utils/mpool.c:173][2024-09-21 06:04:38] DEBUG mpool CUDA Event objects destroyed
[snail02:0:1057126 - utils/mpool.c:173][2024-09-21 06:04:38] DEBUG mpool CUDA Stream objects destroyed
[snail02:1:1057127 - context.c:1417][2024-09-21 06:04:38] DEBUG External memory deregister, addr:(nil) len:18446744073709551615 device:mlx5_1
[snail02:1:1057127 - utils/mpool.c:173][2024-09-21 06:04:38] DEBUG mpool sharp_coll_reqs destroyed
[snail02:1:1057127 - utils/mpool.c:173][2024-09-21 06:04:38] DEBUG mpool sharp_coll_handles destroyed
[snail02:1:1057127 - utils/mpool.c:173][2024-09-21 06:04:38] DEBUG mpool sharp_buffer_mpool destroyed
[snail02:1:1057127 - utils/mpool.c:173][2024-09-21 06:04:38] DEBUG mpool CUDA Event objects destroyed
[snail02:1:1057127 - utils/mpool.c:173][2024-09-21 06:04:38] DEBUG mpool CUDA Stream objects destroyed
[snail02:0:1057126 - context.c:1269][2024-09-21 06:04:38] DEBUG SHArP job end
[snail02:1:1057127 - context.c:1293][2024-09-21 06:04:38] DEBUG sharp_coll finalized. job_id: 3876323329
[snail02:0:1057126 - context.c:1293][2024-09-21 06:04:38] DEBUG sharp_coll finalized. job_id: 3876323329
nariaki3551nariaki3551

NCCLでSHARP を有効にするには、GPUDirectRDMA が有効になっている必要がありそう。
https://docs.nvidia.com/networking/display/sharpv351/using+nvidia+sharp+with+nvidia+nccl


次に従ってダウンロードする。

https://docs.nvidia.com/holoscan/archive/holoscan-0.4.0/additional_setup.html
https://network.nvidia.com/products/GPUDirect-RDMA/

wget https://www.mellanox.com/downloads/ofed/nvidia-peer-memory_1.1.tar.gz
mv nvidia-peer-memory_1.1.tar.gz nvidia-peer-memory_1.1.orig.tar.gz
tar -xvf nvidia-peer-memory_1.1.orig.tar.gz
cd nvidia-peer-memory-1.1
dpkg-buildpackage -us -uc
sudo dpkg -i ../nvidia-peer-memory_1.1-0_all.deb
sudo dpkg -i ../nvidia-peer-memory-dkms_1.1-0_all.deb
sudo service nv_peer_mem start
sudo service nv_peer_mem status
sudo systemctl enable nv_peer_mem
sudo /lib/systemd/systemd-sysv-install enable nv_peer_mem

確認

$ /usr/local/cuda/gds/tools/gdscheck -p
nariaki3551nariaki3551

しかし今はnvidia-peer-memoryではなくnvidia_peermemが推奨されている

https://docs.nvidia.com/networking/display/gpudirectrdmav18/installing+gpudirect+rdma


git clone https://github.com/Mellanox/nv_peer_memory.git
cd nv_peer_memory
sudo make
sudo make install
sudo /sbin/modprobe nv_peer_mem
sudo echo "nv_peer_mem" | sudo tee -a /etc/modules

確認

$ lsmod | grep nv_peer_mem
nv_peer_mem            16384  0
ib_core               348160  9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia              65785856  4 nvidia_uvm,nv_peer_mem,nvidia_fs,nvidia_modeset
$ /usr/local/cuda/gds/tools/gdscheck -p | grep PeerDirect -A 3
 --Mellanox PeerDirect : Enabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0

nccl-testsのログを見てもRDMAがEnableになっている。

 NCCL_DEBUG=INFO \
NCCL_DEBUG_SUBSYS=ALL \
mpirun -np 2 -H AAA:1,BBB:1 -x LD_LIBRARY_PATH /data/fsdp_overlap/nccl-tests/build/all_reduce_perf -g 1
l# nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 411554 on    AAA device  0 [0x08] NVIDIA A100-PCIE-40GB
#  Rank  1 Group  0 Pid 1400897 on    BBB device  0 [0x84] Tesla V100-PCIE-16GB
AAA:411554:412264 [0] NCCL INFO P2P plugin v8 IBext_v8
AAA:411554:412264 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB ibs4:192.168.100.217<0>
AAA:411554:412264 [0] NCCL INFO NET/IB : GPU Direct RDMA (nvidia-peermem) enabled for HCA 0 'mlx5_0

nariaki3551nariaki3551

NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALLの出力

snail01:2226502:2226502 [0] NCCL INFO AllReduce: opCount 39 sendbuff 0x7fb7e2200000 recvbuff 0x7fb7ef800000 count 8388608 datatype 7 op 0 root 0 comm 0x556e536a6770 [nranks=4] stream 0x556e56921160
snail01:2226502:2226502 [0] NCCL INFO 33554432 Bytes -> Algo 3 proto 2 time 3385.443115

アルゴリズムの番号は下記

https://github.com/NVIDIA/nccl/blob/2ea4ee94bfb04c886c79ccae60ac9961000fdee2/ext-tuner/example/nccl/tuner.h#L30-L38

https://github.com/NVIDIA/nccl/blob/2ea4ee94bfb04c886c79ccae60ac9961000fdee2/ext-net/example/nccl/types.h#L8-L19

https://github.com/NVIDIA/nccl/blob/2ea4ee94bfb04c886c79ccae60ac9961000fdee2/ext-tuner/example/nccl/tuner.h#L40-L44

このスクラップは2024/11/18にクローズされました