Benchmarking artificial intelligence / machine learning (AI / ML) cluster fabric with realistic workloads typically requires investments into computing systems with GPUs and remote direct memory access (RDMA) network interface controllers (NICs), which are costly and time-consuming to build and operate. Deploying and operating these systems for large-scale validation and experimentation in the lab is necessary to optimize AI networks fully. Proper benchmarking and testing of AI networks requires configuring parameters such as cluster configuration, congestion control, workload algorithms, job data size, traffic profile, and NIC performance.
Generating realistic, high-scale AI workload traffic for network benchmarking requires RDMA / RDMA over Converged Ethernet (RoCEv2) endpoint emulators and software with prepackaged methodologies that support collective communications patterns — including all-to-all, all-reduce, all-gather, and more. The software provides the data workloads specific to AI networks that measure key parameters such as job completion time, algorithm and bus bandwidth, and insights into network fabric performance.
Additional resources for testing AI data center networks
Need help finding the right solution for you?
Können wir Ihnen behilflich sein?