The evolution of artificial intelligence, particularly the rise of large language models (LLMs) and deep learning, has created an immense demand for computational power primarily driven by GPUs. However, maximizing GPU efficiency requires a robust networking infrastructure that ensures minimal latency and zero packet loss. This post highlights the critical role of Data Center Bridging (DCB) with a focus on its two key components; Priority-based Flow Control (PFC) and Explicit Congestion Notification (ECN) in achieving lossless and high-performance networks suitable for large-scale GPU deployments in AI applications.
The Background: Network Challenges in Large GPU Clusters
AI workloads generate significant east-west traffic due to frequent exchanges of parameters and activations among GPUs. This synchronized “incast” traffic can overwhelm standard Ethernet networks, causing buffer overflows and packet drops. Such packet loss negatively impacts AI training and inference by slowing down performance, stalling pipelines, and potentially affecting model convergence.
When large models span multiple GPUs, inference requires sequential data transfers between GPUs. Network congestion or packet loss stalls the pipeline, reducing latency and throughput, and lowering queries per second (QPS). The lossless design of DCB ensures smooth data flow, safeguarding the return on investment in GPU hardware.
During fine-tuning across hundreds of GPUs, collective communication operations like All-Reduce demand continuous data exchange. Network congestion causes GPUs to idle while waiting for data, wasting compute cycles and increasing costs.
The Solution: Data Center Bridging (DCB) to support AI Networks
DCB, standardized as IEEE 802.1Qxx, enhances Ethernet to support low-latency and lossless communication- essential for AI workloads. Its core technologies, PFC and ECN, ensure uninterrupted GPU traffic flow. PFC prevents packet loss, and ECN manages flow rates proactively to minimize idle times and maintain high GPU utilization.
DCB includes two key components:
1. Priority-based Flow Control (PFC)
PFC (IEEE 802.1Qbb) refines traditional Ethernet PAUSE frames by selectively pausing traffic based on priority classes rather than halting all traffic. The mechanism works as follows:
PFC benefits AI workloads by preventing packet loss, isolating high-priority GPU traffic from other flows, and delivering predictable performance with reduced jitter and improved job completion times. However, PFC alone may cause “pause storms” if not finely tuned, necessitating the complementary use of ECN.
2. Explicit Congestion Notification (ECN)
ECN (RFC 3168) acts as a proactive congestion control mechanism by marking packets instead of dropping them:
Combined with Data Center Quantized Congestion Notification (DCQCN), ECN enables RoCEv2 traffic to dynamically adapt to network conditions, maintaining low latency and high throughput .
The Take-Away: Synergy of PFC and ECN in Large-Scale GPU Deployments
Together, PFC and ECN provide the robust, low-latency lossless network fabric needed for AI:
Conclusion
In enterprise GPU clusters supporting multiple teams and diverse workloads, DCB enables network administrators to assign different Classes of Service (CoS) and ECN thresholds. This ensures that high-priority inference traffic remains unaffected by large training jobs while guaranteeing throughput for all workloads. Examples include:
As AI models continue to scale, network performance becomes equally critical as GPU capabilities. Implementing DCB with PFC and ECN transforms Ethernet into a lossless, congestion-aware fabric that maintains full GPU utilization and keeps AI workloads on schedule. These industry-standard technologies protect GPU investments, support diverse service level agreements (SLAs), and accelerate AI innovation. The document underscores that without PFC and ECN, AI networks face stalling, whereas with them, AI scales effectively .