IBM Cloud Docs
NVIDIA Hopper 1 cluster network profile

NVIDIA Hopper 1 cluster network profile

The Hopper 1 cluster network profile (hopper-1) provides isolated networks for NVIDIA Hopper HGX instances running workloads that require high-bandwidth, low-latency interconnectivity, such as AI training and large-scale simulations.

The Hopper 1 cluster network profile, which supports both NVIDIA H100 and H200 instance profiles, replaces the deprecated H100 cluster network profile.

Overview

The Hopper 1 cluster network profile supports Remote Direct Memory Access (RDMA) over the Convergent Ethernet v2 (RoCEv2) network protocol for increased throughput, reduced latency, and improved system performance.

For a list of supported instance profiles with this cluster network, see Hopper HGX instance profiles.

Availability

Currently, the use of the Hopper 1 cluster network profile is available in the following supported regions and zones:

Supported regions and zones for Hopper 1
Region Zone Universal zone name
Frankfurt (eu-de) eu-de-2 eu-de-fra04-a
Washington DC (us-east) us-east-3 us-east-wdc07-a

To understand how various regions correspond to zones, see zone mapping per account.

Capabilities and restrictions

The cluster network Hopper 1 profile has the following capabilities and restrictions:

  • Type: Dedicated
  • Bandwidth: 3.2 Tbps (8x 400 Gbps)
  • Custom routing tables: No
  • Dynamic route servers: No
  • Endpoint gateways: No
  • Floating IPs: No
  • Flow logs: No
  • LBaaS: No
  • NVLink: Yes (900 GB/s)
  • Private Path: No
  • Public gateway: No
  • Reserved IPs: Yes
  • Secondary IPs: No
  • VPN: No

Tested NCCL configuration

The NVIDIA Collective Communications Library (NCCL) can determine the optimal paths between system components, including GPUs and NICs, by referencing VSI-provided PCI topology information. This topology file is tested for an H100 VSI with eight cluster subnets, and can be used if you want to provide a topology file that uses the NCCL_TOPO_FILE environment variable.

The following information provides tested NCCL tunings for an H100 VM profile with an 8-subnet cluster network. All testing was done on NCCL version 2.22.3. For more information, see the NVIDIA NCCL documentation.

export NCCL_IB_PCI_RELAXED_ORDERING=2
export NCCL_IB_QPS_PER_CONNECTION=16
export NCCL_IB_ADAPTIVE_ROUTING=1
export NCCL_IB_TIMEOUT=22
export NCCL_IB_RETRY_CNT=10
export NCCL_CHECKS_DISABLE=1
export NCCL_CHECK_POINTERS=0
export NCCL_CROSS_NIC=2
export NCCL_ASYNC_ERROR_HANDLING=1
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_SOCKET_NTHREADS=4
export NCCL_NSOCKS_PERTHREAD=4
export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 # valid for an 8-subnet cluster network
export NCCL_TOPO_FILE=<path-to-xml-topology-file> #Sample file provided below, valid for gx3d-160x1792x8h100 profile VSI, with an 8-subnet cluster network