SU540: Chelsio T6 NIC errors cause system shutdown when upgrading from 40G to 100G network switches

Views:
1,381
Last Updated:
2024/3/22 07:03:14

收藏

Summary

[Impact Critical: Possible cluster data outage]

After converting Chelsio T6-based Ethernet ports from 40GbE to 100GbE speeds, a continuous high number of CRC errors are reported due to corrupted Ethernet packets. These errors can potentially lead to a system disruption.

Issue Description

  • After converting T6-based Ethernet ports from 40GbE to 100GbE speeds, a continuous high number of CRC errors are reported due to corrupted Ethernet packets.
  • Link parameters are not cleared after a 40GbE to 100GbE port conversion, resulting in the generation of malformed packets.
  • In some cases, the receipt of these corrupted packets can lead to a system disruption.
  • Port speed changes can occur in the following example scenarios:
    • 40GbE Cluster switch or Ethernet data switches are replaced with 100GbE models
    • Cluster ports are temporarily configured at 40GbE for a storage system upgrade, but the final port speed configuration is 100GbE

Symptom

After replacing 40GbE switches with 100GbE switches, CRC errors and long frames will increment due to malformed Ethernet packets.

cluster1::> system node run -node local -command "ifstat -a"

RECEIVE
Total frames: 292g   | Total bytes: 1485t           | Total errors: 8746
Total discards: 1276 | Multi/broadcast: 4612k | No buffers: 0
CRC errors: 8564      | Runt frames: 0                | Fragment: 0
Long frames: 182      | Jabber: 0                          | Alignment errs: 0
over/underruns: 0    | Xon: 0                               | Xoff: 0
Jumbo: 193g

Workaround

Take over and give back (reboot) the affected nodes to correctly reset link parameters on the T6 ports.

Solution

ONTAP 9.14.1, 9.13.1P8,9.12.1P11 and later releases contain new NIC firmware to resolve BUG ID 1570339.

Additional Information

BUG ID 1570339