Data Center Bridging is a Critical OPEN Technology for AI Data Centers

The evolution of artificial intelligence, particularly the rise of large language models (LLMs) and deep learning, has created an immense demand for computational power primarily driven by GPUs. However, maximizing GPU efficiency requires a robust networking infrastructure that ensures minimal latency and zero packet loss. This post highlights the critical role of Data Center Bridging (DCB) with a focus on its two key components; Priority-based Flow Control (PFC) and Explicit Congestion Notification (ECN) in achieving lossless and high-performance networks suitable for large-scale GPU deployments in AI applications.

The Background: Network Challenges in Large GPU Clusters

AI workloads generate significant east-west traffic due to frequent exchanges of parameters and activations among GPUs. This synchronized “incast” traffic can overwhelm standard Ethernet networks, causing buffer overflows and packet drops. Such packet loss negatively impacts AI training and inference by slowing down performance, stalling pipelines, and potentially affecting model convergence.

When large models span multiple GPUs, inference requires sequential data transfers between GPUs. Network congestion or packet loss stalls the pipeline, reducing latency and throughput, and lowering queries per second (QPS). The lossless design of DCB ensures smooth data flow, safeguarding the return on investment in GPU hardware.

During fine-tuning across hundreds of GPUs, collective communication operations like All-Reduce demand continuous data exchange. Network congestion causes GPUs to idle while waiting for data, wasting compute cycles and increasing costs. 

The Solution: Data Center Bridging (DCB) to support AI Networks

DCB, standardized as IEEE 802.1Qxx, enhances Ethernet to support low-latency and lossless communication- essential for AI workloads. Its core technologies, PFC and ECN, ensure uninterrupted GPU traffic flow. PFC prevents packet loss, and ECN manages flow rates proactively to minimize idle times and maintain high GPU utilization.

DCB includes two key components:

1. Priority-based Flow Control (PFC)

PFC (IEEE 802.1Qbb) refines traditional Ethernet PAUSE frames by selectively pausing traffic based on priority classes rather than halting all traffic. The mechanism works as follows:

  • AI traffic (e.g., RoCEv2) is classified into a high-priority class.
  • When a switch’s buffer for that class fills, it sends a PFC pause frame upstream.
  • Only the high-priority traffic pauses; other traffic continues.
  • Transmission resumes once the buffer clears.

PFC benefits AI workloads by preventing packet loss, isolating high-priority GPU traffic from other flows, and delivering predictable performance with reduced jitter and improved job completion times. However, PFC alone may cause “pause storms” if not finely tuned, necessitating the complementary use of ECN.

2. Explicit Congestion Notification (ECN)

ECN (RFC 3168) acts as a proactive congestion control mechanism by marking packets instead of dropping them:

  • Switches mark packets with “congestion experienced” when queues reach a threshold.
  • Receivers notify senders to reduce transmission rates.
  • This soft rate-limiting smooths traffic flow, preventing hard stops triggered by PFC.

Combined with Data Center Quantized Congestion Notification (DCQCN), ECN enables RoCEv2 traffic to dynamically adapt to network conditions, maintaining low latency and high throughput .

The Take-Away: Synergy of PFC and ECN in Large-Scale GPU Deployments

Together, PFC and ECN provide the robust, low-latency lossless network fabric needed for AI:

  • ECN serves as the first line of defense by signaling congestion early to reduce traffic rates.
  • PFC acts as a last-resort mechanism to prevent packet loss for critical GPU data.
  • The combination optimizes RoCEv2 traffic for both latency-sensitive “mice flows” (inference) and bulk “elephant flows” (training).
  • It balances workloads effectively, ensuring both throughput and low latency .

Conclusion

In enterprise GPU clusters supporting multiple teams and diverse workloads, DCB enables network administrators to assign different Classes of Service (CoS) and ECN thresholds. This ensures that high-priority inference traffic remains unaffected by large training jobs while guaranteeing throughput for all workloads. Examples include:

  • R&D teams requiring consistent throughput for large fine-tuning jobs.
  • Quality control demanding sub-10ms latency and near-zero loss for real-time defect detection.
  • Financial risk analysis needing high throughput for batch inference to meet compliance deadlines .

As AI models continue to scale, network performance becomes equally critical as GPU capabilities. Implementing DCB with PFC and ECN transforms Ethernet into a lossless, congestion-aware fabric that maintains full GPU utilization and keeps AI workloads on schedule. These industry-standard technologies protect GPU investments, support diverse service level agreements (SLAs), and accelerate AI innovation. The document underscores that without PFC and ECN, AI networks face stalling, whereas with them, AI scales effectively .

Author's Bio

Mark Harris

Global Head of Marketing, Edgecore-An Accton Company

Infrastructure industry veteran with vast hands-on experience across all aspects of digital infrastructure. Recognized for building and managing high-performing sales, business development, channel and marketing teams, fostering business growth and delivering impactful programs. Relationship builder. Previously held executive-level positions with Digital, Extreme Networks, Citrix, Vertiv, Legrand, Uptime Institute and NetBrain.