SU529: [Impact: Critical] SSD firmware for TPM5xxxATE/ATD to prevent potential data loss or disruption/unavailability

Views:
5,039
Last Updated:
7/21/2023, 8:43:58 AM

收藏

Summary

[Impact Critical: Potential for data loss or cluster data outage]

NetApp® has identified that the drive models listed in the table below fail at a higher rate than other drives shipped by NetApp. As a result, NetApp has implemented a drive firmware fix that can be upgraded non-disruptively to mitigate the issue. The updated firmware is available from the Drive Firmware Download page on the NetApp Support site.

Update to minimum drive firmware for the affected drive part numbers and identification strings, below:

Part Number Drive Identifier Capacity Firmware
X/SP-357A X357_TPM5V3T8ATE 3.8TB NA55
X/SP-319A X319_TPM5V7T6ATE 7.68TB NA55
X/SP-371A X371_TPM5V960ATE 960GB NA55
X/SP-670A X670_TPM5V15TATE 15.3TB NA55
X/SP-379A X379_TPM5V960ATE 960GB NA55
X/SP-364A X364_TPM5V3T8ATE 3.8TB NA55
X/SP-374A X374_TPM5V960ATE 960GB NA55
X/SP-358A1 X358_TPM5V3T8ATE 3.84TB NA55
X/SP-386A1 X386_TPM5V960ATE 960GB NA55

1 FIPS (Federal Information Processing Standards – Encrypted)

Part Number Drive Identifier Capacity Firmware
X/SP-356A X356_TPM5V3T8ATE 3.8TB NA05
X/SP-447A X447_TPM5V800ATD 800GB NA05
X/SP-440A/B1 X440_TPM5V800ATD 800GB NA05
X/SP-363A X363_TPM5V3T8ATE 3.8TB NA05
X/SP-449A X449_TPM5V800ATD 800GB NA05
X/SP-577A1 X577_TPM5V800ATD 800GB NA05

1 FIPS (Federal Information Processing Standards – Encrypted)

Issue Description

This drive model uses a GC (SSD Garbage Collection) mechanism that could cause latency and drive timeouts, potentially impacting failure rates. The new firmware alters the way the mechanism works to help prevent drive timeouts that could lead to failure.

Symptom

Messages similar to the following might be indicative of the issue:

Fri Jan 20 23:25:20 -0600 [NODE_0b2f14: scsi_cmdblk_strthr_admin: scsi.cmd.checkCondition:error]: Disk device 9d.31.1: Check Condition: CDB 0x9a:0000000163286600:0001:0200: Sense Data SCSI:aborted command - (0xb - 0x2f 0x14 0x0)(148365)

Sat Feb 11 08:51:01 +0000 [NODE_0314ff: disk_server_1: disk.outOfService:notice]: Drive 1a.00.2 (SERIAL0314FF): sense information: SCSI:medium error(0x03), ASC(0x14), ASCQ(0xff), FRU(0x00). Power-On Hours: 23719, GList Count: 3, Drive Info: Disk 1a.00.2 Shelf 0 Bay 2 [NETAPP X358_TPM5V3T8ATE NA52] S/N [SERIAL0314FF] UID [00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000].

Sat Feb 11 08:09:54 +0000 [NODE_0314ff: disk_server_0: disk.ioMediumError:notice]: Medium error on disk 1a.00.2: op 0x2a:6bdd0200:0200 sector 1809646080 SCSI:medium error - If the disk is in a RAID group, the subsystem will attempt to reconstruct unreadable data (3 14 ff 0) (4494) UID [00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000]

Sat Feb 11 08:09:54 +0000 [NODE_0314ff: disk_server_0: disk.ioMediumError:notice]: Medium error on disk 1a.00.2: op 0x2a:6bdd0400:0200 sector 1809646592 SCSI:medium error - If the disk is in a RAID group, the subsystem will attempt to reconstruct unreadable data (3 14 ff 0) (4490) Disk 1a.00.2 Shelf 0 Bay 2 [NETAPP X358_TPM5V3T8ATE NA52] S/N [SERIAL0314FF] UID[00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000]

Feb 11 16:42:06 +0000 [NODE_0b2f10: scsi_cmdblk_strthr_admin: scsi.cmd.checkCondition:error]: Disk device 1a.00.6: Check Condition: CDB 0x88:000000019a5ef168:00000008: Sense Data SCSI:aborted command - (0xb - 0x2f 0x10 0x0)(4503).

Node Root Aggregate:

If more drives are failed than RAID tolerance (>1 RAID 4, >2 RAID DP or >3 RAID-TEC), the node will not boot due to missing node root volume. Messages similar to this may be seen on the console:

[{nodename}:raid.assim.tree.noRootVol:error]: No usable root volume was found!

If fewer drives are failed than RAID tolerance (1 RAID 4, 2 RAID DP or 3 RAID-TEC), the node will stay online and RAID reconstruct will begin to replace the missing drives.

Data Aggregate:

If more drives are failed than RAID tolerance (>1 RAID 4, >2 RAID-DP or >3 RAID-TEC), the aggregate will be failed and all data volumes within the aggregate will be unavailable.

Nodeshell aggr status –r or sysconfig -r output shows this failure:

Aggregate {aggrname} (failed, raid_dp, partial) (block checksums)

    Plex /{aggrname}/plex0 (offline, failed, inactive)

If less drives are failed than RAID tolerance (1 RAID 4, 2 RAID-DP or 3 RAID-TEC), the aggregate will stay online and RAID reconstruct will begin to replace the missing drives. All data volumes will remain available.

Note: In any event, where more drives are failed than RAID tolerance, immediate engagement with technical support is strongly recommended.

Solution

Please download updated firmware for your drives from the Drive Firmware Download page on the NetApp Support site. Also, automatic firmware downloads and updates are available in our current recommended release of ONTAP 9.10.1.

Additional Information

See Bug 1529178

In accordance with the Support Services terms, always update NetApp products with the latest version of firmware and software to provide the best reliability, availability, and serviceability:

Hot spare drives: To best maintain the continuous presence of hot spare drives available in the system, maintain the minimum recommended number of hot spares, and follow the standard drive replacement process if a drive fails.

Active IQ System Risk Detection:

For customers who have enabled AutoSupport on their storage systems the Active IQ Portal provides detailed System Risk reports at the customer and site and system levels. The reports show systems that have specific risks as well as severity levels and mitigation action plans. Drives that are not running the latest firmware is an example of such a risk. Not upgrading to the most current drive firmware could leave the storage appliance vulnerable to undesirable behavior.