Skip to main content

A Detailed Checklist for Link Failures Troubleshooting
in AI Data Centers

Introduction

This guide provides a structured approach to troubleshooting network link failures in AI data centers, specifically targeting issues where the link cannot up. Walking you from physical-layer validation to device configuration, it helps isolate the root cause efficiently.

Follow the steps below to verify fiber connectivity, optical power, and protocol settings.

1. Verify Fiber and Transceiver Compatibility

  • Verification Guidelines:
    • Ensure the local and peer optical transceivers use the same fiber transmission mode (single-mode or multi-mode). Use fiber cables with appropriate length.
  • Notes:
    • Single-mode transceivers are typically marked with LR on the label. Multi-mode transceivers are typically marked with SR.
    • Single-mode fiber is usually labeled as SMF, and the jacket color is typically yellow. Multi-mode fiber is usually labeled as MMF, and the jacket color is typically orange or aqua.

2. Verify Fiber Cable Status

Verification Guidelines:

  • Inspect the fiber for visible physical damage, including no sheath damage, no kinks, and no twisting or coiling.
  • Check the fiber connector. Ensure it is intact, the latch is secure, and the end-face is clean and free of dust.
  • Use a known-good fiber cable and connect it to a verified working NIC or switch port for a loopback test. If the link comes up, the original external fiber is likely faulty and should be replaced.

    3. Verify Optical Transceiver Status

    • Verification Guidelines:
      • Inspect the optical transceiver gold fingers. Ensure the surface is clean and free of dust.
      • Use a previously verified working transceiver and connect it to a NIC or switch port for a loopback test. If the link comes up, the original transceiver is likely faulty and should be replaced.
    • Notes:
      • ⚠️ Recommendation: Use optical transceivers from the same vendor on both ends to avoid compatibility issues.

    4. Verify Optical Power Status

    Verification Guidelines:

    • On the switch, run show interface transceiver ethernet interface_num eeprom detail and verify that the values under MonitorData are displayed correctly and remain within the ThresholdData range.
    • On the server, run ethtool -m <NIC name> (for Mellanox NICs, use mlxlink -d <NIC ID> -m). Verify that the Laser Output Power and Receiver Signal Average Optical Power values are within the defined threshold range.

      If the optical power values on either the switch or server are out of range, or one lane shows weak optical power even if within range, perform the following checks:

      • Verify the fiber connector is properly seated in the transceiver with no looseness.
      • Ensure the fiber is fully inserted into the port.
      • Verify TX/RX polarity is correct on both ends. TX of module A should connect to RX of module B.
      • Replace the optical transceiver if the issue persists.

      5. Optional – Verify SNR and BER Status

      This step is only supported on the CX684E-N.

      • Verification Guidelines:
        • On the switch, run show interface summary to obtain the interface devport ID. The first port number in the “Lanes” column is the devport ID.
        • Check BER by running ivmcmd “ifcs show devport <devport_id>” | grep fec_ber.
        • Check SNR by running ivmcmd “ifcs show devport <devport_id>” | grep snr.
        • Notes:
          • Bit Error Rate (BER) is the ratio of errored bits to total transmitted bits. Lower values indicate higher transmission accuracy. BER should be < 1e-9. If it exceeds 1e-9 or reaches 1e-8, it may cause packet loss or link down.
          • Signal-to-Noise Ratio (SNR) represents the ratio between useful signal power and noise power. Higher values indicate cleaner signal and stronger interference resistance. Unit is dB. SNR should be > 20 dB. Values below this threshold may cause link flapping, instability, or link down.

        6. Verify Port Administrative Status

        Verification Guidelines:

        • Run show interface summary to check the administrative status of the port.
        • Run show monitor-link to verify whether the port is a downstream port in a Monitor-Link group, and whether it is down due to the upstream port being down.
        • Run show interface errdown to check whether the port is in errdown state caused by MC-LAG down status.

        7. Verify Speed, FEC, and Auto-Negotiation Settings

        • Verification Guidelines:
          1. Verify that the speed configuration is consistent on both ends.
          2. Run show interface summary or show running-config to check whether interface breakout is configured. Mismatched breakout modes on both ends will prevent the link from coming up.
          3. Verify auto-negotiation status on both ends:
            • If auto-negotiation is disabled on both ends, ensure speed and FEC configurations match.
            • If enabled on one end and disabled on the other, it is recommended to disable auto-negotiation on both ends.
            • If enabled on both ends, ensure the advertised speed modes match (on Mellanox NICs, use ethtool <NIC name> to verify). If the link still does not come up, disable auto-negotiation on both ends and manually configure speed and FEC.
          • Notes:
            • Auto-negotiation: It is disabled by default on switch ports. NICs typically have auto-negotiation enabled by default. For 10G and above interconnects between switches or between switch and NIC, it is recommended to disable auto-negotiation on both ends.
            • FEC mode: For 25G and higher-speed interfaces, RS-FEC is recommended.

          8. Contact Technical Support Team

          Specialized in SONiC & OpenWiFi

          Need Expert Assistance?

          Asterfusion’s TAC team is standing by to provide in-depth technical support for your open networking deployment.

          Support Portal

          If the link still cannot come up, scroll down and submit a support case. Collect the following device, interface, and transceiver information:

          Required information:

          • Record the switch model, software version, and optical transceiver model.
          • Record the interface number that cannot come up, its configuration (show interface summary and show running-config, including speed, FEC mode, and auto-negotiation status), and the interface hardware status.
            • For Falcon chip-based switches series ( CX732Q-N-V2, CX532P-N-V2, CX308P-48Y-N-V2) , run mrvlcmd -c “show interfaces status all”
            • For Teralynx chip-based switches series( such as CX732Q-N), run ivmcmd “ifcs show devport” and ivmcmd “ifcs show devport all”
          • Record the peer device type and model (for switch-to-switch, record the remote switch model; for server connections, record the NIC model).
          • On the server side, collect NIC model and driver information using the following commands:
            • ethtool <NIC name>
            • ethtool -i <NIC name>
            • ethtool -m <NIC name>
            • mlxlink -d <NIC ID> -m
            • mlxlink -d <NIC ID> -e