Skip to main content

IPT (In-band Path Telemetry) Technology Whitepaper: Network Quality Monitoring for AI Data Centers

1 Overview

With the rapid development of high-performance applications such as AI large model training and distributed computing, AI computing networks face increasingly stringent requirements for end-to-end path quality monitoring. Traditional network monitoring technologies (such as SNMP) are limited by their “pull-only” collection mode and insufficient granularity, making them inadequate for monitoring micro-burst anomalies in network-wide path latency and packet loss.

INT (In-band Network Telemetry) represents the next generation of network quality analysis technology. Through active “push-mode” data collection by network devices, INT achieves millisecond-level data acquisition and precisely captures network anomalies. INT technology encompasses three solutions: BDC (Buffer Drop Capture), HDC (High Delay Capture), and IPT (In-band Path Telemetry). BDC and HDC solutions have been introduced in previous whitepapers and will not be elaborated upon here.

IPT is one of the standard solutions within INT technology. By replicating packets from specific traffic flows and carrying path statistics information, IPT enables precise end-to-end path quality monitoring. IPT technology configures ingress nodes, egress nodes, and transit nodes within a telemetry domain, utilizing an 8-byte Probe Marker to uniquely identify the telemetry domain. It generates probe packets along the original path, collects statistics from each node, and ultimately encapsulates the data for transmission to a collector, providing network operations with multi-dimensional analysis capabilities for network-wide path quality.

The following table compares the three solutions across different dimensions:

Solution
BDC
HDC
IPT
Trigger Condition
Queue buffer overflow causing packet drop
Queue forwarding latency reaches the configured threshold
None
Telemetry Information
Queue occupancy status
Forwarding latency
Queue depth and forwarding latency
Sampling Mechanism
Probabilistic capture, micro-burst capture
Probabilistic capture, micro-burst capture
Probabilistic capture
Focus Scenario
Buffer packet drop capture and reporting
High latency anomaly diagnosis in lossless networks
Problem localization in large-scale networks, full-path quality monitoring

Table 1: INT Technology Solution Comparison

1.1 Functional Scenarios

IPT is well-suited for end-to-end path monitoring scenarios in AI computing networks, particularly playing a critical role in the following areas:

  • Network-wide Path Quality Analysis: By collecting latency, queue status, and other information from each node, IPT identifies performance bottlenecks along the path.
  • Dynamic Path Optimization: Combined with path quality data, IPT assists in adjusting intelligent routing strategies to improve data transmission efficiency.
  • Rapid Fault Troubleshooting: Through node information carried in probe packets, IPT quickly pinpoints anomalous nodes or links.

1.2 Basic Concepts

1.2.1 IPT Packet Format
As illustrated in Figure 1, an IPT packet consists of multiple header layers, including outer L2/L3 encapsulation, GRE header, IPT Shim header, Probe Marker, and per-node statistics information fields.

IPT-Packet-Format

Figure 1: IPT Packet Format

  • L2/IPv4
    Users specify the outer encapsulation Layer 2 and IPv4 packet headers in the IPT configuration.
  • GRE Header
    Figure 2 shows the GRE Header packet format, with Table 2 containing descriptions of each field.
    Figure 2: GRE Header
GRE-Header
Field
Length (bits)
Description
C
1
Flag indicating whether Checksum is present
Reserved
12
Reserved bits
Version
3
Version information
Protocol Type
16
IPT Shim Header protocol type

Table 2: IPT GRE Header Information

IPT Shim Header
Figure 3 shows the IPT Shim Header packet format, with Table 3 containing descriptions of each field.

IPT-Shim-Header

Figure 3: IPT Shim Header

Field
Length (bits)
Description
Next Header
8
Indicates the next packet header. For Ethernet II, the value is 3.
Length
4
Shim Header length (in 4-byte units). For IPT, this value will be 4 (i.e., 4×4 = 16 bytes).
Switch ID
16
Identifies the Switch ID of the egress node device
Extension Header
6
Type of extension header. For IPT, this value is 7

Table 3: IPT Shim Header Information

IPT Probe Marker
The IPT Probe Marker is a 64-bit user-specified value used to identify IPT packets. The most significant 2 bytes of the IPT Probe Marker must be 0.

Field
Length (bits)
Description
Probe Marker 
64
A 64-bit user-specified value used to identify the telemetry domain uniquely.

Table 4: IPT Probe Marker Information

IPT Base Header
Following the IPT Probe Marker is the IPT Base Header (4 bytes), which is used to identify the version and hop count. Figure 4 shows the IPT Base Header packet format, with Table 5 containing descriptions of each field.

Figure 4: IPT Base Header

Field
Length (bits)
Description
Version
5
Version of the IPT Base Header
Hop Count
8
Number of hops for IPT node information

Table 5: IPT Base Header Information

IPT Hop Information
Between each switching node in the telemetry domain (including ingress and egress nodes), per-hop statistics information is inserted into the transmitted IPT packet. Figure 5 shows the packet format for per-hop information, with Table 6 listing descriptions of each field.

Figure 5: IPT Hop Information Header

Field
Length (bits)
Description
Switch ID 
16
Switch ID of the node device corresponding to this hop information
Dev Class 
6
Unique encoding used to identify the device, used for decoding information in the packet
Queue Size Info*
20
Information about queue occupancy size
Dinfo 2*
4
Egress queue information for IPT packet forwarded from this hop node
Dinfo 1*
12
Egress interface information for IPT packet forwarded from this hop node
Egress Timestamp Info*
20
Timestamp information for IPT packet forwarded from this hop node
Sinfo*
12
Ingress interface information for IPT packet entering this hop node
Ingress Timestamp Info*
20
Timestamp information for IPT packet entering this hop node

*Note: Decoding the corresponding real values from raw data depends on Dev Class.

Table 6: IPT Hop Information

2 Working Principles

2.1 Workflow

IPT-Workflow-Diagram

Figure 6: IPT Workflow Diagram

Figure 6 illustrates the overall workflow of IPT: the ingress node generates probe packets, transit nodes collect information, and the egress node encapsulates and sends packets, achieving end-to-end path information collection. Probe packets are clones of original packets (with payload truncated), transmitted along the same path as the original packets, with statistics information inserted at each node, and ultimately sent to a user-configured collector.

2.2 Process Breakdown

2.2.1 Ingress Node

After enabling the IPT function, the ingress node identifies specific traffic flows through two methods: sampling or configuring DSCP to specify queues. It replicates the original packet and truncates the payload, inserting the Probe Marker, Base Header, and ingress node statistics information after the first sixteen bytes of UDP or TCP, generating a probe packet that is forwarded along the original packet’s forwarding path.

2.2.2 Transit Node

The transit node identifies probe packets carrying the Probe Marker, collects local node statistics information, and inserts it after the probe packet’s Base Header. The modified probe packet is then forwarded along the original packet’s forwarding path.

2.2.3 Egress Node

The egress node identifies probe packets carrying the Probe Marker, collects local node statistics information and inserts it after the probe packet’s Base Header, then adds outer encapsulation. Based on the outer MAC and IP addresses, it performs a forwarding table lookup and forwards the probe packet to the user-configured collector.

3 Typical Application Scenarios

3.1 End-to-End Path Optimization for AI Computing Networks

In a large model training scenario involving a GPU cluster with over a thousand cards, the cluster relies on a high-performance network to achieve inter-node data synchronization (such as All Reduce operations). Path quality directly impacts training efficiency. IPT technology can optimize path performance in the following areas:

1. End-to-End Path Latency Monitoring

As shown in Figure 7, during the training process, gradient data must be forwarded through multiple Leaf/Spine switches. IPT collects forwarding latency from each node through probe packets and, combined with the total latency from ingress to egress, pinpoints high-latency nodes (such as a Spine switch with abnormally elevated forwarding latency). This assists in adjusting traffic forwarding paths to avoid overall training efficiency degradation caused by single-node delays.

fault-configuration-diagram

Figure 7: Identifying High-Latency Nodes

2. Dynamic Queue State Awareness

As shown in Figure 8, when multiple GPU servers send data through the same switch port, the egress queue may experience congestion due to traffic surges. IPT probe packets carry information such as queue occupancy size and QP (Queue Pair). Operations personnel can quickly identify congested queues and adjust buffer allocation strategies (such as increasing burst traffic handling capacity) to ensure data synchronization stability.

dcbx-configuration-exchange-diagram-between-switches

Figure 8: Multiple GPU Servers Sending Data Through the Same Switch Port