SU490: [Impact: Critical] SSD Best Practices: Avoid risk of drive failure and data loss if powered off

Views:
2,775
Last Updated:
2024/5/23 02:07:18

收藏

Summary

This bulletin discusses the best practices for long-term power removal from enterprise-class solid-state drives (SSD) to avoid impact from data loss and minimize risk of hardware failures.

This bulletin applies to any SSD including ones from NetApp. For a complete list of SSD PN's, refer to NetApp Hardware Universe (HWU) for drives:

  • Select "Search by OS and Drive Type".
  • Choose your OS.
  • Select these Drive Types:
    • Capacity Flash NVMe SSD
    • NVMe SSD
    • SSD
  • Click "Show Results"

Issue Description

NetApp is providing guidance on this issue in the Solution section of this article. The text below in our issue description is provided only as a courtesy on industry-standard SSD characteristics from the technology association listed. All of this testing was done on non-QLC drives.

SSDs based on NAND flash memory will slowly leak charge over time if left for long periods without power. The JEDEC Solid State Technology Association standard for SSDs, JESD218B.02 shows powered-off data retention of 3 months at 40C (104F) for the "Enterprise" application class of SSD during testing that writes a minimum amount of data to the drive equal to the capacity of the drive (for example, 15TB of writes for 15TB drives).

Additional technical details are available in this presentation from JEDEC. (No login).

Based on this standard, should Enterprise SSDs remain unpowered for more than 3 months, the data might not be recoverable when power is reapplied. The amount of time until data is at risk varies greatly depending on the amount of wear on the SSDs and the temperature of the environment in which they are stored. Note that QLC-class media is expected to have survivability below previous media types such as TLC.

Please see this example under "Enterprise" from the same study that shows this variation just on temperature alone.

Content below directly excerpted from: JESD218B.02

SU490_JEDEC_Temperatures_2023-04-11.png

Source of table above: JESD218B.02

Data retention of powered-down SSD media varies by many factors, including the type of flash memory (e.g. TLC, MLC, QLC), amount of wear of the drive's useful life, and storage temperature of the SSDs. NetApp makes no guarantees, either implied or expressed, that data retention of powered-down SSD media will meet any specific extended duration.

NetApp recommends best practices that can protect customers who deploy Enterprise SSDs and plan to remove power for extended periods of time. NetApp's recommendations are conservative and favor pre-emptive risk mitigation before the 3-month mark.

Enterprise SSDs are considered to be unpowered if the hosting shelf/enclosure has all power removed including removal of the power cable source if the system does not have a method of complete power removal.

Symptom

SSDs are reported as failed after being powered off for an extended period of time as discussed in the symptom. These are general failure examples, and not universally indicative of a specific failure for this problem.

FAS/AFF/ONTAP

00.18: NETAPP X358_TPM4V3T8AME NA01 0.0GB 0B/sect (Failed-Unsupported)

00.19: NETAPP X358_TPM4V3T8AME NA01 3662.5GB 520B/sect (Z7D0A0Z7T0PE) (Failed)

10.1: NETAPP X357_TPM5V3T8ATE NA54 3662.5GB 520B/sect (29D0A00XTRXF) (Failed)

Storage system might fail to fully boot or RAID groups might be degraded.

Contact Technical Support to determine if your system is impacted by this condition.

Workaround

There is no workaround for this issue - it is common to all Enterprise SSDs. See the Solution section for NetApp's best practices.

Solution

NetApp strongly recommends the following three best practices for removing power from Enterprise SSDs. These practices are a conservative recommendation as environmental conditions vary:

  1. Always consider the criticality of ongoing and regular data protection against environmental events and natural disasters that can force a power off period of this length. See NetApp's guidance on Data Protection and Disaster Recovery for these options.
  2. If removing power from Enterprise SSDs for greater than 14 days, have a recent full backup of data.
  3. If removing power from Enterprise SSDs for greater than one months, remove all data from the drives to avoid impact to SSD usability:
FAS/AFF/ONTAP
E-Series
  • Unmap all SSDs in the system prior to storage
  • For secure erasure of encrypted (FDE) and non-encrypted drives, the Secure Erase tool can be used.
  • For SANtricity 8.71 and higher, you can use the procedure above. Remember to leave one drive not erased to re-run the procedure as noted.
SolidFire
  • For non-critical data erasure, using the cluster UI, remove the drives from the node one at a time, each time allowing the slice to sync before removing the block drives.
  • If RTFI is an option, using RTFI will securely erase all drives in the system. See this guide for more details.

If planning to re-introduce power to drives that have been powered off after an extended period, do it well before any planned production operations to have enough time to restore data from backup (if necessary) or replace any failed/inoperable parts.

Excessive failures found on power reintroduction may result in delays in part delivery if they exceed expected failure rates. Additionally, normal drive failures on production systems are prioritized. This scenario can be mitigated by following the above best practices.

Additional Information

In accordance with the Support Services terms, always update NetApp products with the latest version of firmware and software to provide the best reliability, availability, and serviceability:

Hot spare drives: To best maintain the continuous presence of hot spare drives available in the system, maintain the minimum recommended number of hot spares, and follow the standard drive replacement process if a drive fails.

Active IQ System Risk Detection:

For customers who have enabled AutoSupport on their storage systems the Active IQ Portal provides detailed System Risk reports at the customer and site and system levels. The reports show systems that have specific risks as well as severity levels and mitigation action plans. Drives that are not running the latest firmware is an example of such a risk. Not upgrading to the most current drive firmware could leave the storage appliance vulnerable to undesirable behavior.

Important: The purpose of this communication is for NetApp to notify its installed base end users about urgent and important product information that may affect product performance or reliability. The information contained herein and the distribution lists are NetApp confidential materials that are subject to restrictions on redistribution and that cannot be shared outside of this e-mail distribution list.

***************************************************
*** NETAPP CONFIDENTIAL – FOR LIMITED USE ONLY ***
***************************************************