NVIDIA Hopper 1 cluster network profile

The Hopper 1 cluster network profile (hopper-1) provides isolated networks for NVIDIA Hopper HGX instances running workloads that require high-bandwidth, low-latency interconnectivity, such as AI training and large-scale simulations.

The Hopper 1 cluster network profile, which supports both NVIDIA H100 and H200 instance profiles, replaces the deprecated H100 cluster network profile.

Overview

The Hopper 1 cluster network profile supports Remote Direct Memory Access (RDMA) over the Convergent Ethernet v2 (RoCEv2) network protocol for increased throughput, reduced latency, and improved system performance.

For a list of supported instance profiles with this cluster network, see Hopper HGX instance profiles.

Availability

Currently, the use of the Hopper 1 cluster network profile is available in the following supported regions and zones:

Supported regions and zones for Hopper 1
Region	Zone	Universal zone name
Frankfurt (`eu-de`)	`eu-de-2`	`eu-de-fra04-a`
Washington DC (`us-east`)	`us-east-3`	`us-east-wdc07-a`

To understand how various regions correspond to zones, see zone mapping per account.

Capabilities and restrictions

The cluster network Hopper 1 profile has the following capabilities and restrictions:

Type: Dedicated
Bandwidth: 3.2 Tbps (8x 400 Gbps)
Custom routing tables: No
Dynamic route servers: No
Endpoint gateways: No
Floating IPs: No
Flow logs: No
LBaaS: No
NVLink: Yes (900 GB/s)
Private Path: No
Public gateway: No
Reserved IPs: Yes
Secondary IPs: No
VPN: No

Tested NCCL configuration

The NVIDIA Collective Communications Library (NCCL) can determine the optimal paths between system components, including GPUs and NICs, by referencing VSI-provided PCI topology information. This topology file is tested for an H100 VSI with eight cluster subnets, and can be used if you want to provide a topology file that uses the NCCL_TOPO_FILE environment variable.

The following information provides tested NCCL tunings for an H100 VM profile with an 8-subnet cluster network. All testing was done on NCCL version 2.22.3. For more information, see the NVIDIA NCCL documentation.

export NCCL_IB_PCI_RELAXED_ORDERING=2
export NCCL_IB_QPS_PER_CONNECTION=16
export NCCL_IB_ADAPTIVE_ROUTING=1
export NCCL_IB_TIMEOUT=22
export NCCL_IB_RETRY_CNT=10
export NCCL_CHECKS_DISABLE=1
export NCCL_CHECK_POINTERS=0
export NCCL_CROSS_NIC=2
export NCCL_ASYNC_ERROR_HANDLING=1
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_SOCKET_NTHREADS=4
export NCCL_NSOCKS_PERTHREAD=4
export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 # valid for an 8-subnet cluster network
export NCCL_TOPO_FILE=<path-to-xml-topology-file> #Sample file provided below, valid for gx3d-160x1792x8h100 profile VSI, with an 8-subnet cluster network