SU496: [Impact: High] SSD firmware (X40??S172B) to prevent data unavailability

Views:
586
Last Updated:
4/25/2022, 11:42:24 PM

收藏

Summary

[Impact: High = Node disruption and potential loss of data access]

NetApp® has identified that the drive models listed in the table below fail at a higher rate than other drives shipped by NetApp. As a result, NetApp has implemented a drive firmware fix that can be upgraded non-disruptively to mitigate the issue. The updated firmware is available from the Disk Drive Firmware Download page on the NetApp Support site.

Update to minimum drive firmware NA06/NA56 for the affected drive part numbers and identification strings, below:

Part Number Drive Identifier Capacity New Firmware
X4001A X4001S172B1T9NTE 1.9TB NA56
X4002A X4002S172B3T8NTE 3.8TB NA56
X4004A X4004S172B7T6NTE 7.6TB NA56
X4010A X4010S172B1T9NTE 1.9TB NA56
X4011A X4011S172B3T8NTE 3.8TB NA56
X4012A X4012S172B3T8NTE 3.8TB NA06
X4013A X4013S172B7T6NTE 7.6TB NA56
X4014A X4014S172B15TNTE 15.3TB NA56
X4016A X4016S172B3T8NTE 3.8TB NA56
X4018A X4018S172B1T9NTE 1.9TB NA56
X4019A X4019S172B15TNTE 15.3TB NA06

Issue Description

Drives on firmware versions less than NA06/NA56 are at risk for a higher-than-expected rate of failure due to potential for single-bit parity errors, which can lead to data disruption or unavailability.

Symptom

Output similar to the following might be indicative of one or more of this issues:

[NODE1: scsi_cmdblk_strthr_admin: scsi.cmd.checkCondition:error]: Disk device e0d.00.1.1L0: Check Condition: CDB 0x5e:01: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x82)(176000).

[NODE1: scsi_cmdblk_strthr_admin: disk.readReservationFailed:error]: Disk read reservation failed on e0d.00.1.1P3 CDB 0x5e:01 - SCSI:not ready (2 4 1)

[NODE1: sanown_io: diskown.errorDuringIO:error]: error 13 (fatal disk error) on disk e0c.00.2.1P3 (S/N) while reading reservation state

Node Root Aggregate:

If more drives are failed than RAID tolerance (>1 RAID 4, >2 RAID DP or >3 RAID-TEC), the node will not boot due to missing node root volume. Messages similar to this may be seen on the console:

[{nodename}:raid.assim.tree.noRootVol:error]: No usable root volume was found!

If fewer drives are failed than RAID tolerance (1 RAID 4, 2 RAID DP or 3 RAID-TEC), the node will stay online and RAID reconstruct will begin to replace the missing drives.

Data Aggregate:

If more drives are failed than RAID tolerance (>1 RAID 4, >2 RAID-DP or >3 RAID-TEC), the aggregate will be failed and all data volumes within the aggregate will be unavailable.

Nodeshell aggr status –r or sysconfig -r output shows this failure:

Aggregate {aggrname} (failed, raid_dp, partial) (block checksums)

Plex /{aggrname}/plex0 (offline, failed, inactive)

If less drives are failed than RAID tolerance (1 RAID 4, 2 RAID-DP or 3 RAID-TEC), the aggregate will stay online and RAID reconstruct will begin to replace the missing drives. All data volumes will remain available.

Note: In any event, where more drives are failed than RAID tolerance, immediate engagement with technical support is strongly recommended.

Solution

Update ONTAP and/or drive firmware per the above Summary and Issue Description.

Additional Information

See Bug #1338338

In accordance with the Support Services terms, always update NetApp products with the latest version of firmware and software to provide the best reliability, availability, and serviceability:

Hot spare drives: To best maintain the continuous presence of hot spare drives available in the system, maintain the minimum recommended number of hot spares, and follow the standard drive replacement process if a drive fails.

Active IQ System Risk Detection:

For customers who have enabled AutoSupport on their storage systems the Active IQ Portal provides detailed System Risk reports at the customer and site and system levels. The reports show systems that have specific risks as well as severity levels and mitigation action plans. Drives that are not running the latest firmware is an example of such a risk. Not upgrading to the most current drive firmware could leave the storage appliance vulnerable to undesirable behavior.

Important: The purpose of this communication is for NetApp to notify its installed base end users about urgent and important product information that may affect product performance or reliability. The information contained herein and the distribution lists are NetApp confidential materials that are subject to restrictions on redistribution and that cannot be shared outside of this e-mail distribution list.

***************************************************
*** NETAPP CONFIDENTIAL – FOR LIMITED USE ONLY ***
***************************************************