Why Deploy IS-IS on SONiC with BFD and FRR ?

April 15, 2026

Introduction

For IS-IS on SONiC, we extend the integration of IS-IS with BFD and FRR. This turns a network that can simply operate into one that runs in a stable and resilient manner. It reduces fault detection from seconds to milliseconds. It also decouples business failover speed from the SPF convergence time.

CX732Q-M-H

32-Port 400G QSFP-DD Core/Spine Switch Enterprise SONiC Ready

BFD and IS-IS on SONiC

IS-IS on SONiC and BFD integration is primarily used to accelerate link failure detection and address the limitation of IS-IS hello-based neighbor detection, which is inherently slow and operates at a second-level interval.

Limitations of Native IS-IS

IS-IS relies on Hello packets (default interval: 3 seconds on CX-M series switches) to detect neighbor liveness. In case of failures, detection can take several seconds or even tens of seconds depending on timers and hold-down behavior.

This delays overall network convergence, increases traffic loss duration, and has a significant impact in backbone or high-availability environments.

BFD Enables Sub-second Failure Detection for IS-IS

BFD (Bidirectional Forwarding Detection) is an independent protocol designed for fast failure detection. It exchanges control packets at millisecond-level intervals (minimum 10 ms) to quickly identify forwarding path failures.

When IS-IS on SONiC is integrated with BFD, BFD immediately notifies IS-IS when a fault is detected. IS-IS then brings down the adjacency, triggers LSP flooding, and initiates SPF recalculation.

Note: Hardware-offloaded BFD for IS-IS is not supported at present. Once supported, CPU overhead will be significantly reduced.

Result of BFD and IS-IS on SONiC

Failure detection time can be reduced from the default IS-IS level (around 3 seconds) to millisecond scale.

For example, with Detection Multiplier = 3 and Transmission Interval = 10 ms, link failure detection can be achieved in approximately 30 ms.

Overall convergence, combined with FRR, further reduces end-to-end recovery time from fault detection to full network reconvergence.

This approach is suitable for scenarios such as fiber cut events, MSTP transit links, and congested links where IS-IS detection alone may be insufficient.

BFD is not a universal solution.

Under severe congestion, control packets may be dropped, which can lead to false failure detection.

Aggressive parameter tuning also introduces risk. On CX-M series switches, the transmission interval typically ranges from 10 ms to 60000 ms, and the detection multiplier ranges from 2 to 255. If the multiplier is set too low (e.g., 2) in unstable networks, it increases CPU overhead and may cause BFD session oscillations due to transient jitter.

FRR and IS-IS on SONiC (LFA & RLFA)

FRR (Fast Reroute) support in IS-IS on SONiC, including LFA (Loop-Free Alternate) and RLFA (Remote LFA), is designed to address slow traditional convergence. It allows traffic to bypass SPF computation during failures and switch to precomputed backup paths, reducing packet loss and service disruption.

Limitations of Traditional Convergence

When an IS-IS link or node failure occurs, the protocol must flood LSP updates and recompute the SPF tree. This process typically takes hundreds of milliseconds to several seconds.

During this period, a significant amount of traffic is dropped, which severely impacts latency-sensitive services such as video, VoIP, and real-time applications.

How FRR Addresses the Problem

Fast Reroute (FRR) precomputes backup next hops before failures occur. The PLR (Point of Local Repair, the upstream node adjacent to the failure) installs Loop-Free Alternate (LFA) or Remote LFA (RLFA) paths in advance.

When a failure is detected, traffic is immediately switched to the backup path at the PLR, without waiting for SPF recalculation. This effectively decouples traffic forwarding from control-plane convergence, minimizing service impact during network events.

Assume a simple topology with 20 nodes connected in series, where traffic is forwarded hop-by-hop across these nodes.

In a pure IS-IS SPF-based convergence model, when a failure occurs, LSP flooding propagates hop by hop across the network. If we assume an average propagation interval of 3 seconds per node, the total propagation time across 20 nodes would be approximately 20 × 3 = 60 ms.

After all nodes receive the updated LSPs, each device performs SPF computation. Assuming a simplified SPF calculation time of 15 ms, the total convergence time becomes roughly 75 ms.

In contrast, with FRR enabled, the PLR (Point of Local Repair) precomputes loop-free backup paths and installs them into the RIB in advance. Once a failure is detected, traffic is immediately switched to the precomputed backup next hop at the PLR. This eliminates the need to wait for full network-wide LSP flooding and SPF convergence, effectively avoiding the 75 ms convergence delay for traffic switchover.

In practice, SPF computation time does not scale linearly with network size. As the network grows, convergence based solely on SPF becomes increasingly slow due to larger LSDB size and more complex topology computation. With FRR, however, the traffic switchover time (C) remains independent of network scale, as it is determined by local precomputed repair paths rather than global convergence.

Why LFA + RLFA Combination ?

We choose to support the LFA + RLFA combination because it provides broad coverage of most network scenarios with minimal complexity, enabling efficient FRR protection.

LFA (Loop-Free Alternate): The PLR selects a directly connected neighbor as the backup next hop. The computation is simple, performed locally, and does not require tunnels.
RLFA (Remote LFA): When no valid LFA exists, RLFA is used. It selects a remote PQ node that lies in both the P-space and Q-space. Traffic is then forwarded via an LDP tunnel to the PQ node, which continues forwarding toward the destination, covering LFA failure scenarios and topology constraints.

Note: P-space refers to the set of nodes reachable from the PLR without traversing the failed element. Q-space refers to the set of nodes that can reach the destination while avoiding the failure.

Our current implementation strategy uses LFA and RLFA to cover the majority of protection cases, achieving a simplified and practical deployment model.

Conclusion

IS-IS on SONiC combined with BFD provides millisecond-level failure detection. Once a fault is detected, FRR leverages preinstalled backup routes in the device RIB to perform fast traffic switchover.

Deploying BFD and FRR with IS-IS on SONiC forms a complete fast protection chain for traffic forwarding, enabling rapid recovery with minimal service impact.

Need assistance or more info ? 👉

Fill out the Form !

Contact US !

To request a proposal, send an E-Mail to bd@cloudswit.ch
To receive timely and relevant information from Asterfusion, sign up at AsterNOS Community Portal
To submit a case, visit Support Portal
To find user manuals for a specific command or scenario, access AsterNOS Documentation
To find a product or product family, visit Asterfusion-cloudswit.ch

Latest Posts

Low Latency Data Center Switch

Campus Access & Aggregation

Wireless Access Point

Asteria Network Controller

Marvell OCTEON Platform

Optical Transceiver

Open Packet Broker

Network Packet Broker

P4-Programmable Switch

AI Networking

Why Deploy IS-IS on SONiC with BFD and FRR ?

Table of Contents

Introduction

32-Port 400G QSFP-DD Core/Spine Switch Enterprise SONiC Ready