SU448: [Impact: Critical] SSD (PHM2*) firmware to prevent data loss / unavailability

Views:
5,725
Last Updated:
2024/5/16 22:44:10

收藏

Summary

[Impact: Critical = Data loss or cluster data outage]

NetApp® has identified that the drive models listed in the table below will fail after 70,000 power-on hours (~8 years of use) if power-cycled.

As a result, NetApp has implemented a drive firmware fix that can be upgraded non-disruptively to mitigate the issue. The updated firmware is available from the Drive Firmware Download page on the NetApp Support site.

Update to minimum drive firmware NA05 for the affected drive part numbers and identification strings, below:

Part Number Drive Identifier Capacity
X/SP-438A-R6 X438_PHM2400MCTO 400GB
X/SP-439A-R6 X439_PHM21T6MCTO 1.6TB
X/SP-440A-R61 X440_PHM2800MCTO 800GB
X/SP-446A-R6 X446_PHM2200MCTO 200GB
X/SP-446B-R6 X446_PHM2200MCTO 200GB
X/SP-447A-R6 X447_PHM2800MCTO 800GB
X/SP-448A-R6 X448_PHM2200MCTO 200GB
X/SP-449A-R6 X449_PHM2800MCTO 800GB
X/SP-575A-R6 X575_PHM2400MCTO 400GB
X/SP-576A-R6 X576_PHM21T6MCTO 1.6TB
X/SP-577A-R61 X577_PHM2800MCTO 800GB

1 FIPS (Federal Information Processing Standards - Encrypted)

Issue Description

SSD internal logs are periodically recorded and have an upper limit of 70,000 entries (about eight years), after which logging stops. The SSD continues to operate after reaching the 70,000 limit until power is turned OFF/ON, after which subsequent Read/Write commands return an error and user data is inaccessible.

Symptom

Messages similar to the following might be indicative of the issue:

Node Root Aggregate:

If more drives are impacted than RAID tolerance (>1 RAID 4,>2 RAID DP or >3 RAID-TEC), the node will not boot due to missing node root volume.

Messages similar to this may be seen on the console: [{nodename}:raid.assim.tree.noRootVol:error]: No usable root volume was found!

If fewer drives are impacted than RAID tolerance (=<1 RAID 4,2 RAID DP or 3 RAID-TEC), the node will stay online and RAID reconstruct will begin to replace the missing drives.

Data Aggregate:

If more drives are impacted than RAID tolerance (>1 RAID 4,2 RAID-DP or 3 RAID-TEC), the aggregate will be failed and partial and all data volumes within the aggregate will be unavailable.

Aggr status -r/sysconfig -r outputs like this:

Aggregate {aggrname} (failed, raid_dp, partial) (block checksums)

Plex /{aggrname}/plex0 (offline, failed, inactive)

If less drives are impacted than RAID tolerance (=<1 RAID 4,2 RAID-DP or 3 RAID-TEC), the aggregate will stay online and RAID reconstruct will begin to replace the missing drives. All data volumes will remain available.

Note: In any event, where more drives are impacted than RAID tolerance, immediate engagement with technical support is strongly recommended.

Solution

Upgrade drive firmware as soon as possible.

Additional Information

See Bug # 1335350

In accordance with the Support Services terms, always update NetApp products with the latest version of firmware and software to provide the best reliability, availability, and serviceability:

Hot spare drives: To best maintain the continuous presence of hot spare drives available in the system, adhere to Hot Spares Best Practices and follow the standard drive replacement process if a drive fails.

Active IQ System Risk Detection:

For customers who have enabled AutoSupport on their storage systems, the Active IQ Portal provides detailed System Risk reports at the customer and site and system levels. The reports show systems that have specific risks as well as severity levels and mitigation action plans. Drives that are not running the latest firmware is an example of such a risk. Not upgrading to the most current drive firmware could leave the storage appliance vulnerable to undesirable behavior.

Important: The purpose of this communication is for NetApp to notify its installed base end users about urgent and important product information that may affect product performance or reliability. The information contained herein and the distribution lists are NetApp confidential materials that are subject to restrictions on redistribution and that cannot be shared outside of this communication.

***************************************************
*** NETAPP CONFIDENTIAL – FOR LIMITED USE ONLY ***
***************************************************