IPT (In-band Path Telemetry) Technology Whitepaper: Network Quality Monitoring for AI Data Centers
1 Overview
With the rapid development of high-performance applications such as AI large model training and distributed computing, AI computing networks face increasingly stringent requirements for end-to-end path quality monitoring. Traditional network monitoring technologies (such as SNMP) are limited by their “pull-only” collection mode and insufficient granularity, making them inadequate for monitoring micro-burst anomalies in network-wide path latency and packet loss.
INT (In-band Network Telemetry) represents the next generation of network quality analysis technology. Through active “push-mode” data collection by network devices, INT achieves millisecond-level data acquisition and precisely captures network anomalies. INT technology encompasses three solutions: BDC (Buffer Drop Capture), HDC (High Delay Capture), and IPT (In-band Path Telemetry). BDC and HDC solutions have been introduced in previous whitepapers and will not be elaborated upon here.
IPT is one of the standard solutions within INT technology. By replicating packets from specific traffic flows and carrying path statistics information, IPT enables precise end-to-end path quality monitoring. IPT technology configures ingress nodes, egress nodes, and transit nodes within a telemetry domain, utilizing an 8-byte Probe Marker to uniquely identify the telemetry domain. It generates probe packets along the original path, collects statistics from each node, and ultimately encapsulates the data for transmission to a collector, providing network operations with multi-dimensional analysis capabilities for network-wide path quality.
The following table compares the three solutions across different dimensions:
|
Table 1: INT Technology Solution Comparison
1.1 Functional Scenarios
IPT is well-suited for end-to-end path monitoring scenarios in AI computing networks, particularly playing a critical role in the following areas:
- Network-wide Path Quality Analysis: By collecting latency, queue status, and other information from each node, IPT identifies performance bottlenecks along the path.
- Dynamic Path Optimization: Combined with path quality data, IPT assists in adjusting intelligent routing strategies to improve data transmission efficiency.
- Rapid Fault Troubleshooting: Through node information carried in probe packets, IPT quickly pinpoints anomalous nodes or links.
1.2 Basic Concepts
1.2.1 IPT Packet Format
As illustrated in Figure 1, an IPT packet consists of multiple header layers, including outer L2/L3 encapsulation, GRE header, IPT Shim header, Probe Marker, and per-node statistics information fields.
Figure 1: IPT Packet Format
- L2/IPv4
Users specify the outer encapsulation Layer 2 and IPv4 packet headers in the IPT configuration. - GRE Header
Figure 2 shows the GRE Header packet format, with Table 2 containing descriptions of each field.
Figure 2: GRE Header
|
Table 2: IPT GRE Header Information
IPT Shim Header
Figure 3 shows the IPT Shim Header packet format, with Table 3 containing descriptions of each field.
Figure 3: IPT Shim Header
|
Table 3: IPT Shim Header Information
IPT Probe Marker
The IPT Probe Marker is a 64-bit user-specified value used to identify IPT packets. The most significant 2 bytes of the IPT Probe Marker must be 0.
|
Table 4: IPT Probe Marker Information
IPT Base Header
Following the IPT Probe Marker is the IPT Base Header (4 bytes), which is used to identify the version and hop count. Figure 4 shows the IPT Base Header packet format, with Table 5 containing descriptions of each field.
Figure 4: IPT Base Header
|
Table 5: IPT Base Header Information
IPT Hop Information
Between each switching node in the telemetry domain (including ingress and egress nodes), per-hop statistics information is inserted into the transmitted IPT packet. Figure 5 shows the packet format for per-hop information, with Table 6 listing descriptions of each field.
Figure 5: IPT Hop Information Header
*Note: Decoding the corresponding real values from raw data depends on Dev Class. |
Table 6: IPT Hop Information
2 Working Principles
2.1 Workflow
Figure 6: IPT Workflow Diagram
Figure 6 illustrates the overall workflow of IPT: the ingress node generates probe packets, transit nodes collect information, and the egress node encapsulates and sends packets, achieving end-to-end path information collection. Probe packets are clones of original packets (with payload truncated), transmitted along the same path as the original packets, with statistics information inserted at each node, and ultimately sent to a user-configured collector.
2.2 Process Breakdown
2.2.1 Ingress Node
After enabling the IPT function, the ingress node identifies specific traffic flows through two methods: sampling or configuring DSCP to specify queues. It replicates the original packet and truncates the payload, inserting the Probe Marker, Base Header, and ingress node statistics information after the first sixteen bytes of UDP or TCP, generating a probe packet that is forwarded along the original packet’s forwarding path.
2.2.2 Transit Node
The transit node identifies probe packets carrying the Probe Marker, collects local node statistics information, and inserts it after the probe packet’s Base Header. The modified probe packet is then forwarded along the original packet’s forwarding path.
2.2.3 Egress Node
The egress node identifies probe packets carrying the Probe Marker, collects local node statistics information and inserts it after the probe packet’s Base Header, then adds outer encapsulation. Based on the outer MAC and IP addresses, it performs a forwarding table lookup and forwards the probe packet to the user-configured collector.
3 Typical Application Scenarios
3.1 End-to-End Path Optimization for AI Computing Networks
In a large model training scenario involving a GPU cluster with over a thousand cards, the cluster relies on a high-performance network to achieve inter-node data synchronization (such as All Reduce operations). Path quality directly impacts training efficiency. IPT technology can optimize path performance in the following areas:
1. End-to-End Path Latency Monitoring
As shown in Figure 7, during the training process, gradient data must be forwarded through multiple Leaf/Spine switches. IPT collects forwarding latency from each node through probe packets and, combined with the total latency from ingress to egress, pinpoints high-latency nodes (such as a Spine switch with abnormally elevated forwarding latency). This assists in adjusting traffic forwarding paths to avoid overall training efficiency degradation caused by single-node delays.
Figure 7: Identifying High-Latency Nodes
2. Dynamic Queue State Awareness
As shown in Figure 8, when multiple GPU servers send data through the same switch port, the egress queue may experience congestion due to traffic surges. IPT probe packets carry information such as queue occupancy size and QP (Queue Pair). Operations personnel can quickly identify congested queues and adjust buffer allocation strategies (such as increasing burst traffic handling capacity) to ensure data synchronization stability.
Figure 8: Multiple GPU Servers Sending Data Through the Same Switch Port