AI in Intelligent and Lossless Data Center Network

The AI era is accelerating. AI is no longer just a data model in a lab. In addition, the industry is constantly exploring the way to implement AI applications. The compound annual growth rate (CAGR) of government, finance, Internet, new retail, new manufacturing, and healthcare industries with AI implementation will exceed 30% in the future three years. As AI is coming, is the underlying network infrastructure that provides key support for AI development already ready?

The algorithm, computing power, and data are the three driving forces for AI development. Today, we have made breakthroughs in the deep learning algorithm. However, algorithm-driven intelligence relies heavily on enormous sample data and high-performance computing capabilities. Revolutionary changes have taken place in storage and computing fields to improve data processing efficiency of AI.

Storage media are evolved from hard disk drives (HDDs) to solid state drives (SSDs) to meet real-time data access requirements, with the media latency reduced by more than 100 times. In terms of computing power, the industry has used GPU servers or even dedicated AI chips, improving data processing capability more than 100 folds.

The network communication latency becomes a bottleneck for further performance improvement. The communication latency is increased from 10% to 60% over the entire storage E2E latency. That is, the storage medium waits for idle communication for more than half of the entire storage access time. Computing has a similar bottleneck. For example, for a voice recognition training, the duration of each iteration task is 650 ms to 700 ms, and the communication latency is 400 ms.  The expensive processor also waits for the communication synchronization of model parameters for more than half of the communication time.

So, the answer to the question that whether the underlying network infrastructure that provides key support for AI development is already ready is half sure.

Figure: Network communication becomes a short plank for system performance

New Requirements of RDMA Migration and AI Networks

Replacing TCP/IP with RDMA has been a trend when AI computing and SSD distributed storage pursue ultimate performance. The dedicated InfiniBand network and traditional Ethernet network are two network bearer solutions for RDMA.

  • InfiniBand is a network communication standard used for high-performance computing. Unlike the traditional TCP/IP protocol stack, InfiniBand has its own network- and transport-layer protocols. Most live networks use IP Ethernet networks. Therefore, InfiniBand cannot match needs for AI computing and distributed storage systems demanding many interconnections. In addition, as a dedicated network technology, InfiniBand cannot reuse O&M experiences and platform on the IP network.
  • The solution of RDMA over traditional IP Ethernet networks lacks a complete packet loss protection mechanism. The packet loss ratio is greater than 10-3, causing the throughput of RDMA to decrease sharply. However, the existing RDMA congestion and scheduling algorithm easily causes queue congestion on network devices, which may cause system risks.

Therefore, RDMA must be carried over an open Ethernet network with zero packet loss and high throughput.

AI Fabric Practice and Future

AI Fabric Intelligent and Lossless Data Center Network solution won the “Best of Show Award” at Interop Tokyo 2018. AI Fabric has also passed the strict test and verification of EANTC. In all test instances for high-performance computing and distributed storage, AI Fabric achieves high throughput and zero packet loss, shortens the inter-HPC node communication duration by 40% based on network latency optimization, and greatly improves the efficiency of innovative services such as AI training.

Currently, AI Fabric has been applied in the Internet and finance industries. In a retail bank, intelligent congestion scheduling of AI Fabric accelerates network communication. Through a site test, the IOPS performance of the storage cluster is improved by 20% and the performance of a single volume reaches 350,000. AI Fabric accelerates performance of the bank branch cloud and provides users with the same experience as accessing local disks.

Author's Bio

George Zhao

Director, OSS & Ecosystem, America Research Center, Huawei Technologies Co., Ltd.