The AI era is accelerating. AI is no longer just a data model in a lab. In addition, the industry is constantly exploring the way to implement AI applications. The compound annual growth rate (CAGR) of government, finance, Internet, new retail, new manufacturing, and healthcare industries with AI implementation will exceed 30% in the future three years. As AI is coming, is the underlying network infrastructure that provides key support for AI development already ready?
The algorithm, computing power, and data are the three driving forces for AI development. Today, we have made breakthroughs in the deep learning algorithm. However, algorithm-driven intelligence relies heavily on enormous sample data and high-performance computing capabilities. Revolutionary changes have taken place in storage and computing fields to improve data processing efficiency of AI.
Storage media are evolved from hard disk drives (HDDs) to solid state drives (SSDs) to meet real-time data access requirements, with the media latency reduced by more than 100 times. In terms of computing power, the industry has used GPU servers or even dedicated AI chips, improving data processing capability more than 100 folds.
The network communication latency becomes a bottleneck for further performance improvement. The communication latency is increased from 10% to 60% over the entire storage E2E latency. That is, the storage medium waits for idle communication for more than half of the entire storage access time. Computing has a similar bottleneck. For example, for a voice recognition training, the duration of each iteration task is 650 ms to 700 ms, and the communication latency is 400 ms. The expensive processor also waits for the communication synchronization of model parameters for more than half of the communication time.
So, the answer to the question that whether the underlying network infrastructure that provides key support for AI development is already ready is half sure.
Figure: Network communication becomes a short plank for system performance
New Requirements of RDMA Migration and AI Networks
Replacing TCP/IP with RDMA has been a trend when AI computing and SSD distributed storage pursue ultimate performance. The dedicated InfiniBand network and traditional Ethernet network are two network bearer solutions for RDMA.
Therefore, RDMA must be carried over an open Ethernet network with zero packet loss and high throughput.
AI Fabric Practice and Future
AI Fabric Intelligent and Lossless Data Center Network solution won the “Best of Show Award” at Interop Tokyo 2018. AI Fabric has also passed the strict test and verification of EANTC. In all test instances for high-performance computing and distributed storage, AI Fabric achieves high throughput and zero packet loss, shortens the inter-HPC node communication duration by 40% based on network latency optimization, and greatly improves the efficiency of innovative services such as AI training.
Currently, AI Fabric has been applied in the Internet and finance industries. In a retail bank, intelligent congestion scheduling of AI Fabric accelerates network communication. Through a site test, the IOPS performance of the storage cluster is improved by 20% and the performance of a single volume reaches 350,000. AI Fabric accelerates performance of the bank branch cloud and provides users with the same experience as accessing local disks.