Data Center Fast CNP Congestion Notification Technology White Paper
1 Background
In data center and high-performance network environments, network congestion is a critical issue affecting data transmission efficiency and service quality.
As shown in the figure below, traditional congestion control mechanisms require network devices to detect congestion, mark the ECN field in packets, and forward them to the traffic receiver. After receiving the marked packets, the receiver sends a CNP(Congestion Notification Packet)[1] to the traffic sender through upper-layer protocols (such as RoCEv2). The sender then reduces its transmission rate upon receiving the CNP. This prolonged congestion feedback path can result in feedback delays of up to half an RTT(Round-Trip Time)[2], preventing sender servers from reducing traffic in a timely manner. This leads to further congestion deterioration due to increased buffer occupancy in forwarding devices, potentially triggering network-wide traffic suspension caused by PFC flow control.
Figure 1: Traditional CNP Congestion Feedback Path
To address the slow feedback issue in traditional congestion control mechanisms, the industry has introduced Fast CNP (Fast Congestion Notification Packet) technology. By optimizing congestion marking and feedback paths, this technology significantly improves the real-time responsiveness and effectiveness of network congestion control, making it a core technology for modern data center network optimization.
Figure 2: Fast CNP Congestion Feedback Path
2 Operating Principles
2 Operating Principles
2.1 Basic Concepts
Fast CNP technology incorporates the following concepts:
Table 1: Fast CNP Related Terms and Definitions
Term | Definition |
Flow | A group of packets sharing common attributes (typically IP 5-tuple) |
Flow Table | A collection of information entries recording sender and receiver IP addresses and QP numbers from packets |
Session | A network communication connection established based on the RoCEv2 protocol for data exchange |
2.2 Flow Table Maintenance
Fast CNP technology actively learns information from packets passing through the device and establishes relevant flow tables on switches, thereby obtaining information about traffic senders and receivers. When congestion occurs, the switch directly constructs corresponding CNPs for flows on the congested path and sends them to senders to reduce the transmission rate of relevant flows, achieving rapid congestion feedback. Flow entries in the flow table support aging mechanisms based on either entry capacity or time. When disconnect request packets are detected, corresponding flow entries are removed from the flow table.
2.2.1 Flow Table Establishment
Figure 3: RoCEv2 Session Establishment Process
As shown above, when a sender and receiver interact through the RoCEv2 protocol, session establishment is completed through a four-way CM message exchange:
By capturing CM interaction messages, the switch can extract key information including source/destination IP, source/destination QP, and determine whether the corresponding RoCEv2 session has been successfully established. When a session is successfully established, the corresponding flow entry is added to the flow table.
2.2.2 Flow Table Updates
Figure 4: RoCEv2 Data Interaction Process
After connection establishment, the sender and receiver complete data exchange within the session through RC Send/Write/Read and RC ACK messages.
RC Send/Write/Read messages contain only the receiver’s QP number, while RC ACK messages contain only the sender’s QP number. During data exchange, the switch continuously captures Send and ACK messages in the RoCEv2 flow, extracting source/destination IPs and destination QP numbers. It then queries the flow table and updates the flow expiration time, ensuring that active entries in the flow table do not age out while the switch is carrying service traffic.
2.2.3 Flow Table Aging
Figure 5: RoCEv2 Session Disconnection Process
After data interaction is complete, the sender and receiver complete session disconnection through a two-way CM exchange:
By capturing RoCEv2 CM messages, the switch can extract source/destination IPs and destination QP numbers, and determine whether the corresponding RoCEv2 session has been disconnected. When a session is disconnected, the flow table is queried and the corresponding flow entry is removed, achieving a session-state-based flow table aging mechanism.
Additionally, aging mechanisms based on entry capacity or time are supported. If the number of flow entries in the flow table reaches the user-configured flow table size, newly added flow entries will replace the least active flow entries in the table, preventing entry resource overflow. When a RoCEv2 session has no data exchange for an extended period and the idle time exceeds the user-configured threshold, the switch considers the session expired and removes the corresponding flow entry from the flow table.
2.3 Congestion Feedback
2.3.1 Congestion Detection
Through forwarding delay monitoring technology, the switch can capture packets whose forwarding delay exceeds the user-configured threshold and record their forwarding delay. Since forwarding delay is strongly correlated with queue depth, the switch linearly converts the recorded delay value to queue depth and compares it with the real-time available buffer of the queue to confirm whether congestion has occurred. If congestion is detected, a CNP is constructed to notify the sender to reduce its transmission rate.
2.3.2 Congestion Notification
When congestion is confirmed on a particular path, the switch constructs a corresponding CNP for flows on that path. A properly formatted CNP that can be accepted and processed by NICs must contain the following information:
- Source MAC address / Destination MAC address
- Source IP address / Destination IP address / IP-DSCP value
- Destination UDP port
- Opcode, Destination QP number
The source MAC address/destination MAC address, source IP address/destination IP address, and destination QP number can be obtained by querying the flow table. The destination UDP port and CNP Opcode can be determined through RoCEv2 protocol specifications. The IP-DSCP value is associated with endpoint NIC configuration and is typically manually configured by users.
After constructing the appropriate CNP, the switch directly sends it to the sender, enabling timely rate reduction.
3 Typical Application Scenarios
3 Typical Application Scenarios
Figure 6: Traditional CNP Feedback Path in High-Bandwidth RoCEv2 Data Center Networks
In high-bandwidth networks, due to multiple links and flows transmitting simultaneously, link bandwidth growth often far exceeds forwarding device buffer capacity growth.
Traditional congestion feedback mechanisms—switch ECN marking plus endpoint device CNP feedback—have relatively long paths. As shown in Figure 6, when multiple servers in POD#1 interact with Server65 in POD#2 and congestion occurs at Leaf1, the switch marks congested flow packets with ECN. These ECN-marked packets flow through Spine1 and Leaf9 before reaching Server65. During this process, since Server65 has not yet sent CNP notifications to reduce the transmission rate of multiple servers in POD#1, congestion at Leaf1 will further intensify, potentially causing buffer overflow and triggering PFC flow control.
Figure 7: Fast CNP Feedback Path in High-Bandwidth RoCEv2 Data Center Networks
As shown in Figure 7, after enabling Fast CNP functionality on Leaf1, when congestion occurs at Leaf1, the switch directly sends CNPs to multiple servers in POD#1, effectively shortening the CNP feedback path. This allows senders to reduce transmission rates in time before forwarding device buffers overflow, significantly reducing PFC trigger probability and improving overall network bandwidth utilization while ensuring network-wide traffic stability.
[1] CNP (Congestion Notification Packet): A protocol control packet sent by forwarding devices or receivers to notify senders to reduce their transmission rate.
[2] RTT (Round-Trip Time): The total time required for a data packet to travel from sender to receiver and back to sender, serving as a key metric for measuring network latency.