SU464: System has drives with over 6 years of operating life - possible impact to system resilience

Views:
5,726
Last Updated:
7/28/2023, 2:14:12 PM

收藏

Applies To:

All NetApp storage systems that store data on performance or capacity hard disk drives

Value of reviewing this information:

NetApp® on-premises storage solutions with external shelves support a level of flexibility that allows customers to repurpose their storage shelves and associated drives when new technology becomes available. This flexibility is a benefit to your business and IT operations, but it also requires an understanding of storage resilience and the operational age of drives, along with potential implications for long-term use. Your storage system contains performance or capacity hard disk drives (HDDs) with at least 6 years of power-on hours (POH), which exceeds the manufacturer’s drive warranty and expected enterprise HDD reliable life span by 1 or more years.

Although these aging drives have not failed yet and therefore are not eligible for replacement under hardware entitlements, NetApp highly recommends retiring drives that are older than 5 years. This is especially true for drives that are hosting a data tier where a higher risk of impact to data availability or data integrity is unacceptable. If these aged drives have not reached end of support (EOS), they could still be used for applications where the storage does not host business-critical data.

Enterprise HDD manufactures provide a warranty to OEM suppliers of 5 years (43,800 hours). This 5-year period represents the expected reliable lifespan of the drive, as well as the period during which manufacturers will test and assess failure rates and release firmware updates to address known field issues. The likelihood that a drive will fail increases significantly the longer the drive stays in use beyond its warranty period. As noted on the NetApp Support Site, from a storage system perspective, NetApp provides support for at least 5 years for most of its products. Therefore, customers have the ability (up to the shelf or drive EOS) to move storage as platforms are refreshed, but there is exposure to increased risk of impacts when the drives are used for extended periods.

Doesn't ONTAP® and other operating systems such as SANtricity® have RAID and other features that prevent the impact of drive failures?

NetApp RAID-DP® and RAID-TEC® offer fault tolerance of two or three failed data drives, respectively. However, configurations in which RAID groups are highly populated with drives that have exceeded their warranty period are at higher risk of encountering multiple drive failures than those with data and parity drives within the warranty period. When attempting to maximize storage subsystem component reliability for business-critical data and applications, it is important to replace or retire hardware components that have exceeded their warranty period.

NetApp will not proactively replace drives that are at higher risk from extended use beyond the expected enterprise HDD reliable lifespan. This is because the drives have not failed. They have simply been in operation long enough to no longer be as reliable as drives that are still within the enterprise HDD reliable lifetime. As with any other NetApp hardware, component replacement occurs only when hardware has failed. To learn more, see NetApp Support Services Terms and Conditions.

Why does maintaining data on an enterprise HDD until the end of support milestone create increased risk?

The EOS milestone is what determines whether existing entitlements for a storage system extend to the storage subsystem. EOS milestones for specific drives can be found in the EOA Storage page (End of Hardware Support). Once the EOS milestone has passed for a given hardware part, it will not be covered by any hardware replacement entitlements. For example, an EOS HDD or SSD will not be replaced if a drive fails, even if the system has hardware replacement entitlements. NetApp drive part numbers are available to buy for multiple years, leveraging multiple source vendors—potentially longer than the warranty period of any specific drive. When the part has reached the end of availability (EOA) milestone, it is no longer sold; NetApp support entitlements are available for 5 additional years. Because the post-EOA support period is 5 years, it is possible to continue to renew support entitlements on HDD parts that might be running for up to 11 years (5+ years older than the reliable lifespan of a drive). This is especially true if the drive has been migrated from an older to a newer storage system.

A NetApp support entitlement can be used to ensure hardware support if a failure occurs. However, continued replacement part availability beyond the EOA milestone should not be interpreted as a guarantee of storage subsystem resilience equivalent to that of drives that are within their current operating lifespan (the first 43,800 power-on hours). And it is not a guarantee of comparable drive failure rates. Older drives might encounter issues that have never been seen in the model during its initial 5 year run time and can begin to experience more frequent failures.

Data stored on older drives should be assessed to determine whether the value of avoiding new capital expense balances out the increased risk to availability or increased likelihood of having to restore from backup to established RTOs and RPOs. For example, nonproduction data or possibly non-business-critical data might meet these criteria.

To learn more about storage subsystem resilience, refer to Technical Report TR-3437 - Storage Subsystem Resiliency Guide.

How is the Active IQ wellness check validated?

The power-on hours (POH) of disk drives are checked by Active IQ®. There are 8,760 hours in a year, so 6 years is 52,560 hours. Values greater than 52,560 trigger a warning in NetApp Active IQ. Active IQ determines drive power-on hours from the STORAGE-DISK section collected via AutoSupport. Power-on hours are tracked for each drive.

To show disk power-on hours, use the following ONTAP command:

cluster::> disk show -power-on-hours >52560 -fields power-on-hours
disk    power-on-hours
------- --------------
1.11.22 55802
1.12.8  70206
1.12.14 58766
1.12.15 58018
1.12.23 57499
2.21.5  56714
2.21.6  57501
2.21.12 56210
2.21.14 56050
2.21.18 58301
2.22.1  58023
11 entries were displayed.

In some versions of ONTAP, BSAS/SATA drives roll their power-on hours when the drives exceed 65,535 power-on hours. This means that it might be possible to see 4 years of power-on hours on HDDs that have been spinning for more than 11 years. The wellness check has additional part-related logic to compensate for this specific reporting issue.

What should I do about the information provided by this Active IQ wellness check?

Make sure that the increased risk to storage subsystem resilience is acceptable for the data that is stored by the HDD. If it is not acceptable, consider refreshing the storage to ensure that the data is hosted on HDDs or SSDs that are within their warranty period.

If a refresh is not possible:

  1. Ensure that a backup and potential DR solution such as SnapMirror or SnapVault is in place.
  2. Consider converting RAID groups to RAID-TEC for additional resiliency.
  3. In general, minimize “power off” events when HDD life greatly exceeds five years.
  4. Consider running in DR if expecting prolonged operation in excessively high temperatures.
  5. Ensure sufficient spares.
  6. Promptly replace failed drives.
  7. If using NAS protocols consider protecting your SVM root volumes via configure root volume protection.
  8. Ensure that shelf firmware, drive firmware, and storage OS revision align to the latest recommendation.

Drive firmware updates associated with a NetApp support bulletin should always be treated as high priority and should be applied as soon as possible, even in the “keep the lights on” scenarios that are sometimes applied to aging systems. Delayed application of firmware that provides early detection mechanisms can increase the risk specific to issues the firmware update addresses.

To enable easier, low-effort drive firmware updates, ONTAP 9.10.1 has been enhanced with a feature to automatically download and apply drive firmware.

Note: The Active IQ wellness risk for power-on hours cannot be resolved by calling NetApp and requesting that the drives be replaced as part of your support contract. If a drive is covered by the appropriate support entitlement and has not reached end of support, it will be replaced only upon failure.

If you have any technical questions about this risk, contact NetApp Support. For questions specific to storage refresh, engage your sales representative.

FAQ

Does this risk mean that my HDD is not good or can only be used for 5 years?

No. NetApp customers are welcome to use and entitle hardware for as long as the hardware is supported. This risk simply means that the HDD has exceeded the normally advisable operational lifetime. Benefits of continued use are additional value and postponement of new capital expense. However, there is a tradeoff of increased risk associated with the operational age of the HDD. Most customers do not keep their storage for extended years. There are many benefits to refresh, including power and space efficiency, performance improvements, and other technology improvements, making the TCO of purchasing new storage more compelling.

Can NetApp Technical Support review my drives and determine if they will fail in the future?

No. If the drives are being flagged for this risk, they are 6 years of age or older. They are simply at higher risk than newer drives, although any specific HDD may continue to run for many years. The flag is to help you make sure that you have the right resilience for your data.

What sort of issues might be seen on these older drives?

Issues that have never been seen in the HDD family during its normal lifetime might be seen, including a higher-than-expected failure rate.

How can I tell the age of a drive?

See “How is the Active IQ wellness check validated?” earlier in this document.

Will NetApp Support replace my older drives based on this risk?

No. These drives have not failed. Drives that do fail during normal operation will be replaced in accordance with existing support entitlements. If older drives are not desirable, contact your NetApp reseller or account team about purchasing newer storage.

Does NetApp have any data specific to increasing failure rates on HDDs in operation for time periods in excess of 5 years?

Yes - NetApp field reliability data has demonstrated higher failure rates for older drives. On average, HDDs that have exceeded the 5-year warranty period show ARR that are over twice as high as HDDs within the 5-year warranty period.

Does the guidance in this bulletin apply to SSDs?

SSD reliability is more influenced by factors such as workload, rated life, flash type, and prompt/disciplined firmware adoption according to the findings of this study : https://www.usenix.org/system/files/fast20-maneas.pdf. Minimal failure rate increase was found near the end of reliable lifetime (5 years). However, the study period did not exceed 5 years – it is not advisable to run hardware well beyond the supplier’s warranty period.