SU490: [Impact: Critical] SSD Best Practices: Avoid risk of drive failure and data loss if powered off
- Views:
- 2,775
- Last Updated:
- 2024/5/23 02:07:18
收藏
Summary
This bulletin discusses the best practices for long-term power removal from enterprise-class solid-state drives (SSD) to avoid impact from data loss and minimize risk of hardware failures.
This bulletin applies to any SSD including ones from NetApp. For a complete list of SSD PN's, refer to NetApp Hardware Universe (HWU) for drives:
- Select "Search by OS and Drive Type".
- Choose your OS.
- Select these Drive Types:
- Capacity Flash NVMe SSD
- NVMe SSD
- SSD
- Click "Show Results"
Issue Description
NetApp is providing guidance on this issue in the Solution section of this article. The text below in our issue description is provided only as a courtesy on industry-standard SSD characteristics from the technology association listed. All of this testing was done on non-QLC drives.
SSDs based on NAND flash memory will slowly leak charge over time if left for long periods without power. The JEDEC Solid State Technology Association standard for SSDs, JESD218B.02 shows powered-off data retention of 3 months at 40C (104F) for the "Enterprise" application class of SSD during testing that writes a minimum amount of data to the drive equal to the capacity of the drive (for example, 15TB of writes for 15TB drives).
Additional technical details are available in this presentation from JEDEC. (No login).
Based on this standard, should Enterprise SSDs remain unpowered for more than 3 months, the data might not be recoverable when power is reapplied. The amount of time until data is at risk varies greatly depending on the amount of wear on the SSDs and the temperature of the environment in which they are stored. Note that QLC-class media is expected to have survivability below previous media types such as TLC.
Please see this example under "Enterprise" from the same study that shows this variation just on temperature alone.
Content below directly excerpted from: JESD218B.02
Source of table above: JESD218B.02
Data retention of powered-down SSD media varies by many factors, including the type of flash memory (e.g. TLC, MLC, QLC), amount of wear of the drive's useful life, and storage temperature of the SSDs. NetApp makes no guarantees, either implied or expressed, that data retention of powered-down SSD media will meet any specific extended duration.
NetApp recommends best practices that can protect customers who deploy Enterprise SSDs and plan to remove power for extended periods of time. NetApp's recommendations are conservative and favor pre-emptive risk mitigation before the 3-month mark.
Enterprise SSDs are considered to be unpowered if the hosting shelf/enclosure has all power removed including removal of the power cable source if the system does not have a method of complete power removal.
Symptom
SSDs are reported as failed after being powered off for an extended period of time as discussed in the symptom. These are general failure examples, and not universally indicative of a specific failure for this problem.
FAS/AFF/ONTAP |
00.18: NETAPP X358_TPM4V3T8AME NA01 0.0GB 0B/sect (Failed-Unsupported) 00.19: NETAPP X358_TPM4V3T8AME NA01 3662.5GB 520B/sect (Z7D0A0Z7T0PE) (Failed) 10.1: NETAPP X357_TPM5V3T8ATE NA54 3662.5GB 520B/sect (29D0A00XTRXF) (Failed) |
Storage system might fail to fully boot or RAID groups might be degraded.
Contact Technical Support to determine if your system is impacted by this condition.
Workaround
There is no workaround for this issue - it is common to all Enterprise SSDs. See the Solution section for NetApp's best practices.
Solution
NetApp strongly recommends the following three best practices for removing power from Enterprise SSDs. These practices are a conservative recommendation as environmental conditions vary:
- Always consider the criticality of ongoing and regular data protection against environmental events and natural disasters that can force a power off period of this length. See NetApp's guidance on Data Protection and Disaster Recovery for these options.
- If removing power from Enterprise SSDs for greater than 14 days, have a recent full backup of data.
- If removing power from Enterprise SSDs for greater than one months, remove all data from the drives to avoid impact to SSD usability:
FAS/AFF/ONTAP |
|
E-Series |
|
SolidFire |
|
If planning to re-introduce power to drives that have been powered off after an extended period, do it well before any planned production operations to have enough time to restore data from backup (if necessary) or replace any failed/inoperable parts.
Excessive failures found on power reintroduction may result in delays in part delivery if they exceed expected failure rates. Additionally, normal drive failures on production systems are prioritized. This scenario can be mitigated by following the above best practices.
Additional Information
- How to perform graceful shutdown and power up of all ONTAP nodes in a cluster
- What is the procedure for graceful shutdown and power up of a storage system during scheduled power outage
In accordance with the Support Services terms, always update NetApp products with the latest version of firmware and software to provide the best reliability, availability, and serviceability:
- ONTAP Recommended Releases.
- Drive and Shelf firmware, by default, is typically updated automatically in the background with no disruption to client access:
- Find shelf and drive FW versions.
- Drive firmware download page.
- Shelf firmware download page.
- DQP (Disk Qualification Package): In order for your systems to recognize and utilize newly qualified drives, ensure the latest DQP is installed.
Hot spare drives: To best maintain the continuous presence of hot spare drives available in the system, maintain the minimum recommended number of hot spares, and follow the standard drive replacement process if a drive fails.
Active IQ System Risk Detection:
For customers who have enabled AutoSupport™ on their storage systems the Active IQ Portal provides detailed System Risk reports at the customer and site and system levels. The reports show systems that have specific risks as well as severity levels and mitigation action plans. Drives that are not running the latest firmware is an example of such a risk. Not upgrading to the most current drive firmware could leave the storage appliance vulnerable to undesirable behavior.
Important: The purpose of this communication is for NetApp to notify its installed base end users about urgent and important product information that may affect product performance or reliability. The information contained herein and the distribution lists are NetApp confidential materials that are subject to restrictions on redistribution and that cannot be shared outside of this e-mail distribution list.
***************************************************
*** NETAPP CONFIDENTIAL – FOR LIMITED USE ONLY ***
***************************************************
联想凌拓科技有限公司(“Lenovo NetApp”)不对本页面中提供的任何信息或建议的准确性、可靠性或可维护性,或通过使用这些信息或遵守本文中提供的建议可能获得的任何结果,提供任何陈述或保证。本页面中的信息是按原样分发的,使用这些信息或实施本文中的任何建议或技术是客户的责任,取决于客户评估这些信息并将其整合到客户的运营环境中的能力。本页面及其包含的信息只能与本页面中讨论的 NetApp 产品结合使用。在任何情况下,Lenovo NetApp 均不承担因与使用或执行本页面上提供的信息有关的或导致的任何特殊的、间接的或随之而来的任何损失,或者因使用、数据或利润损失(无论是否在合同履行中)、疏忽或其它侵权行为导致的任何损失。
更多最新信息请参考 NetApp 官网支持公告