SU543: [Impact High] Shelf firmware repeatedly upgrades, system displays wrong disk names, or SAS shelf environmental status updates delayed after upgrade to ONTAP 9.12.1P4 or 9.13.1

Views:
2,457
Last Updated:
2024/3/30 03:17:23

收藏

Summary

[Impact High]

  • A bug introduced in ONTAP 9.12.1P4 and 9.13.1 results in stale information reported to ONTAP regarding SAS storage devices.
  • This can manifest itself in a number of ways, including SAS shelf firmware repeatedly being upgraded, replaced drives reporting incorrect disk information, and delayed SAS shelf environmental information updates reported by ONTAP.
  • This issue only impacts FAS, AFF, or ASA storage systems with SAS storage devices that are running ONTAP 9.12.1P4 or ONTAP 9.13.1.
  • Customers with systems exposed to this issue are advised to upgrade those systems to a release where this issue is fixed (see the Solution section).

This issue is tracked in bug ID 1557006.

Issue Description

Stale SAS environmental shelf data might be processed and displayed by ONTAP versions 9.12.1P4 or 9.13.1. As a result, invalid EMS events might be triggered or shelf maintenance requests might not be properly handled. This can manifest as a number of issues:

  • Issue 1: Shelf firmware updates occur every 30 minutes.
  • Issue 2: A system displays the disk name as <node_name>:<disk_path> after a disk replacement is performed.
  • Issue 3: SAS environmental shelf information updates are delayed in ONTAP 9.12.1P4 and 9.13.1

Symptom

  • Issue 1 example:

    [node-02: dsa_disc: sfu.firmwareDownrev.shelf:error]: Shelf 0a.shelf0 has downrev firmware.
    [node-02: dsa_disc: sfu.firmwareDownrev.shelf:error]: Shelf 0a.shelf1 has downrev firmware.
    [node-02: dsa_sfu: sfu.firmwareDownrev:error]: Disk shelf firmware needs to be updated on 2 disk shelves.
    [node-02: dsa_sfu: sfu.downloadStarted:info]: Update of disk shelf firmware started on 2 shelves.
    [node-02: dsa_worker1: sfu.ctrllerElmntsPerShelf:info]: [storage download shelf]: 2 ES controller elements can be updated on 0b.shelf0.
    [node-02: dsa_worker1: sfu.ctrllerElmntsPerShelf:info]: [storage download shelf]: 2 ES controller elements can be updated on 0b.shelf1.
    [node-02: dsa_worker1: sfu.downloadingController:info]: [storage download shelf]: Downloading IOM12E.0250.SFW on disk shelf controller module A on 0b.shelf0.
    [node-02: dsa_worker1: sfu.downloadingController:info]: [storage download shelf]: Downloading IOM12A.0310.SFW on disk shelf controller module A on 0b.shelf1.
    [node-02: dsa_sfu: sfu.rebootRequest:info]: Issuing a request to reboot disk shelf 0b.shelf0 module A.
    [node-02: dsa_sfu: sfu.rebootRequest:info]: Issuing a request to reboot disk shelf 0b.shelf1 module A.
    [node-02: dsa_sfu: sfu.adapterSuspendIO.ndu:info]: Suspending SMP to SAS adapter 0b for 35 seconds while shelf firmware is updated.
    [node-02: dsa_sfu: sfu.downloadingController:info]: [storage download shelf]: Downloading IOM12E.0250.SFW on disk shelf controller module B on 0a.shelf0.
    [node-02: dsa_sfu: sfu.downloadingController:info]: [storage download shelf]: Downloading IOM12A.0310.SFW on disk shelf controller module B on 0a.shelf1.
    [node-02: dsa_sfu: sfu.rebootRequest:info]: Issuing a request to reboot disk shelf 0a.shelf0 module B.
    [node-02: dsa_sfu: sfu.rebootRequest:info]: Issuing a request to reboot disk shelf 0a.shelf1 module B.
    [node-02: dsa_sfu: sfu.adapterSuspendIO.ndu:info]: Suspending SMP to SAS adapter 0a for 35 seconds while shelf firmware is updated.
    [node-02: dsa_sfu: sfu.downloadSuccess:info]: [storage download shelf]: Firmware file IOM12A.0310.SFW downloaded on 0a.shelf1.
    [node-02: dsa_sfu: sfu.downloadSuccess:info]: [storage download shelf]: Firmware file IOM12E.0250.SFW downloaded on 0a.shelf0.
    [node-02: dsa_sfu: sfu.downloadSummary:info]: Shelf firmware updated on 2 shelves.
    [node-02: storlog_admin: sla.shelf.message:debug]: params: {'type': 'SEVERITY', 'log': 'Thu Jan 1 00:00:00 1970 ( 0+00:00:00.501); 02000093; U?; HAL; hal; 04; Module Reboot: Startup type 3-Internal software reset'}
    [node-02: storlog_admin: sla.shelf.mod.reboot:notice]: Reboot event reported by module A in shelf: 0b.00.99.0, log: (...) 02000093; U?; HAL; hal; 04; Module Reboot: Startup type 3-Internal software reset
    [node-02: storlog_admin: sla.shelf.message:debug]: params: {'type': 'SEVERITY', 'log': (...) 02000093; U?; HAL; hal; 04; Module Reboot: Startup type 3-Internal software reset'}
    [node-02: storlog_admin: sla.shelf.mod.reboot:notice]: Reboot event reported by module A in shelf: 0b.02.99.2, log: (...) 02000093; U?; HAL; hal; 04; Module Reboot: Startup type 3-Internal software reset
    [node-02: storlog_admin: sla.shelf.mod.reboot:notice]: Reboot event reported by module A in shelf: 0b.03.99.3, log: (...) 02000093; U?; HAL; hal; 04; Module Reboot: Startup type 3-Internal software reset
    [node-02: dsa_disc: sfu.firmwareDownrev.shelf:error]: Shelf 0a.shelf0 has downrev firmware.
    [node-02: dsa_disc: sfu.firmwareDownrev.shelf:error]: Shelf 0a.shelf1 has downrev firmware.
  • Issue 2 example:

    node02::> disk show -spare -owner node01

    Info: This cluster has partitioned disks. To get a complete list of spare disk capacity use "storage aggregate show-spare-disks".
    Original Owner: node01
         Checksum Compatibility: block
    Usable Physical
    Disk      HA Shelf Bay Chan   Pool   Type Class     RPM   Size   Size Owner
    -------------- -------------- ----   ----   ---- ----------   ---   ----   ---- -----
    node01:9a.00.17 9a  0  17  A   Pool0   SSD solid-state   -   13.97TB  13.97TB node01
    Device     HA SHELF BAY CHAN   Disk Vital Product Information
    -------------- -------------- ----   ------------------------------
    9a.00.17    9a  ?  ?  SA:A   21G0A00LT1JH


    The following may also be reported:
    Incorrect drive label on shelf 0 bay 17 drive node01:9a.00.17

  • Issue 3 example:

    Shelf log reports failure immediately (time reported as GMT, equates to 14:07 local time)

    Fri Jun 23 05:07:07 2023 (0+02:16:17.705); 030B005B; M0; ENC_MGT; power_manager; 04; PCM 2 faults indicate loss of power (913W)
    Fri Jun 23 05:07:07 2023 (0+02:16:17.705); 030B005D; M0; ENC_MGT; power_manager; 04; PCM 2 faults indicate loss of local fan power


    However, EMS log reports same failure one hour later (local time reported)

    [?] Fri Jun 23 15:00:07 +0900 [ds03n1: dsa_worker3: ses.status.psWarning:error]: DS212-12 (S/N XXXXX245000199) shelf 1 on channel 0b power warning for Power supply 2: warning status; DC undervoltage. This module is on the rear of the shelf at the bottom right.
    [?] Fri Jun 23 15:00:34 +0900 [ds03n1: dsa_worker3: ses.status.psError:alert]: DS212-12 (S/N XXXXX2245000199) shelf 1 on channel 0b power error for Power supply 2: critical status; power supply error. This module is on the rear of the shelf at the bottom right.
    [?] Fri Jun 23 15:00:34 +0900 [ds03n1: dsa_worker3: callhome.shlf.power.intr:error]: Call home for SHELF POWER INTERRUPTED

Workaround

  • There is no effective workaround.
  • A takeover and giveback operation of the affected nodes can temporarily mitigate these issues when seen. However, it will not resolve the underlying problem - subsequent storage issues would still be masked by this bug. To resolve the underlying problem, upgrade to a release where this bug is fixed.

Solution

Upgrade to a version of ONTAP where bug ID 1557006 is fixed.

  • For ONTAP 9.12.1, bug ID 1557006 is fixed in ONTAP 9.12.1P5 and later (9.12.1P7 or later recommended)
  • For ONTAP 9.13.1, bug ID 1557006 is fixed in 9.13.1P1 and later (9.13.1P3 or later recommended)

Additional Information

Public Report

The following KB article contains more information:

Active IQ System Risk Detection:

For customers who have enabled AutoSupport on their storage systems the Active IQ Portal provides detailed System Risk reports at the customer and site and system levels. The reports show systems that have specific risks as well as severity levels and mitigation action plans.

Important: The purpose of this communication is for NetApp to notify its installed base end users about urgent and important product information that may affect product performance or reliability. The information contained herein and the distribution lists are NetApp confidential materials that are subject to restrictions on redistribution and that cannot be shared outside of this e-mail distribution list.

***************************************************
*** NETAPP CONFIDENTIAL – FOR LIMITED USE ONLY ***
***************************************************