How to Test AI Data Center Networks

+ AI Data Center Test Platform

Reproducing realistic network behavior of AI workloads

Benchmarking artificial intelligence / machine learning (AI / ML) cluster fabric with realistic workloads typically requires investments into computing systems with GPUs and remote direct memory access (RDMA) network interface controllers (NICs), which are costly and time-consuming to build and operate. Deploying and operating these systems for large-scale validation and experimentation in the lab is necessary to optimize AI networks fully. Proper benchmarking and testing of AI networks requires configuring parameters such as cluster configuration, congestion control, workload algorithms, job data size, traffic profile, and NIC performance.

Generating realistic, high-scale AI workload traffic for network benchmarking requires RDMA / RDMA over Converged Ethernet (RoCEv2) endpoint emulators and software with prepackaged methodologies that support collective communications patterns — including all-to-all, all-reduce, all-gather, and more. The software provides the data workloads specific to AI networks that measure key parameters such as job completion time, algorithm and bus bandwidth, and insights into network fabric performance.

AresONE and Keysight AI data center builder software

AI data center network test solution

Testing an AI network in a data center requires network traffic emulators, and software with prepackaged methodologies that support AI workloads. The AI data center network test solution includes Keysight AresONE 800GE RoCEv2 endpoint emulation paired with software from the Keysight AI Data Center Builder. This solution can repeatably create scenarios with different data sizes resulting from collective communications in an AI cluster. Each port on the AresONE emulates a GPU and an RDMA NIC. The traffic includes emulating queue pair (QP) connections and flows, generating congestion notifications, performing Data Center Quantized Congestion Notification-based (DCQCN) dynamic rate control, and providing flexibility to test throughput, buffer management, and equal cost multi-path (ECMP) hashing. With this solution, engineers can design improvements in a lab or staging environment, benchmark, and apply the results to a production environment without the need for dedicated AI compute nodes and NICs in the lab.

Get quote