SU486: [Impact: Critical] SSD (TPM4*) firmware to prevent data loss / unavailability

Views:
3,676
Last Updated:
4/27/2022, 2:58:56 PM

收藏

Summary

[Impact: Critical = Data loss or cluster data outage]

NetApp® has identified that the drive models listed in the table below fail at a higher rate than other drives shipped by NetApp. As a result, NetApp has implemented a drive firmware fix that can be upgraded non-disruptively to mitigate the issue. The updated firmware is available from the Drive Firmware Download page on the NetApp Support site.

Update to the minimum drive firmware for these affected drive part numbers and identification strings:

Part Number Drive Identifier Capacity FW
X/SP-356A-R6 X356_TPM4V3T8AME 3.8TB NA04
X/SP-357A X357_TPM4V3T8AME 3.8TB NA54
X/SP-358A X358_TPM4V3T8AME 3.8TB NA04
X/SP-363A-R6 X363_TPM4V3T8AME 3.8TB NA04
X/SP-364A X364_TPM4V3T8AME 3.8TB NA54
X/SP-365A-R6 X365_TPM4V1T6AMD 3.8TB NA04
X/SP-366A-R6 X366_TPM4V1T6AMD 1.6TB NA04
X/SP-438A-R6 X438_TPM4V400AMD 400GB NA04
X/SP-439A-R6 X439_TPM4V1T6AMD 1.6TB NA04
X/SP-440A-R6 X440_TPM4V800AMD 800GB NA04
X/SP-447A-R6 X447_TPM4V800AMD 800GB NA04
X/SP-449A-R6 X449_TPM4V800AMD 800GB NA04
X/SP-575A-R6 X575_TPM4V400AMD 400GB NA04
X/SP-576A-R6 X576_TPM4V1T6AMD 1.6TB NA04
X/SP-577A-R6 X577_TPM4V800AMD 800GB NA04

Issue Description

Impactful issues resolved with this firmware release:

  1. Excessive NAND programming / erasure can lead to media errors and drive failure.
  2. Needless increase in erase counts of some logical blocks during refresh operations can result in gratuitous drive failure for medium errors.
  3. Unnecessary drive failure due to misreported 03/11/FF error.

Symptom

Messages similar to the following might be indicative of the issue:

[node1: disk_server_1: disk.ioMediumError:notice]: Medium error on disk 3d.12.13: op 0x28:00000008:0008 sector 8 SCSI:medium error - Unrecovered read error - If the disk is in a RAID group, the subsystem will attempt to reconstruct unreadable data (3 11 1 0) (2188) Disk 3d.12.13 Shelf 12 Bay 13 [NETAPP   X358_TPM4V3T8AME NA01]…

[node1: disk_server_0: disk.ioMediumError:notice]: Medium error on disk 0a.01.19: op 0x28:dddcc180:0010 sector 3722232192 SCSI:medium error - Unrecovered read error - If the disk is in a RAID group, the subsystem will attempt to reconstruct unreadable data (3 11 ff 0) (4496) Disk 0a.01.19 Shelf 1 Bay 19 [NETAPP   X357_TPM4V3T8AME NA51]…

Node Root Aggregate:

If more drives are failed than RAID tolerance (>1 RAID 4, >2 RAID DP or >3 RAID-TEC), the node will not boot due to missing node root volume. Messages similar to this may be seen on the console:

[{nodename}:raid.assim.tree.noRootVol:error]: No usable root volume was found!

If fewer drives are failed than RAID tolerance (1 RAID 4, 2 RAID DP or 3 RAID-TEC), the node will stay online and RAID reconstruct will begin to replace the missing drives.

Data Aggregate:

If more drives are failed than RAID tolerance (>1 RAID 4, >2 RAID-DP or >3 RAID-TEC), the aggregate will be failed and all data volumes within the aggregate will be unavailable.

Nodeshell aggr status –r or sysconfig -r output shows this failure:

Aggregate {aggrname} (failed, raid_dp, partial) (block checksums)

Plex /{aggrname}/plex0 (offline, failed, inactive)

If less drives are failed than RAID tolerance (1 RAID 4, 2 RAID-DP or 3 RAID-TEC), the aggregate will stay online and RAID reconstruct will begin to replace the missing drives. All data volumes will remain available.

Note: In any event, where more drives are failed than RAID tolerance, immediate engagement with technical support is strongly recommended.

Solution

Update drive firmware per above Summary.

Additional Information

See Bug #1411698

In accordance with the Support Services terms, always update NetApp products with the latest version of firmware and software to provide the best reliability, availability, and serviceability:

Hot spare drives: To best maintain the continuous presence of hot spare drives available in the system, adhere to Hot Spares Best Practices and follow the standard drive replacement process if a drive fails.

Active IQ System Risk Detection:

For customers who have enabled AutoSupport on their storage systems the Active IQ Portal provides detailed System Risk reports at the customer and site and system levels. The reports show systems that have specific risks as well as severity levels and mitigation action plans. Drives that are not running the latest firmware is an example of such a risk. Not upgrading to the most current drive firmware could leave the storage appliance vulnerable to undesirable behavior.

Important: The purpose of this communication is for NetApp to notify its installed base end users about urgent and important product information that may affect product performance or reliability. The information contained herein and the distribution lists are NetApp confidential materials that are subject to restrictions on redistribution and that cannot be shared outside of this e-mail distribution list.

***************************************************
*** NETAPP CONFIDENTIAL – FOR LIMITED USE ONLY ***
***************************************************