SU511: [Impact: Critical] HDD firmware (NE01) for WUH721818AL5204 causing potential data loss or disruption/unavailability
- Views:
- 1,005
- Last Updated:
- 2024/5/8 21:56:46
收藏
Summary
[Impact: Critical = Data loss and/or cluster data outage]
An issue has been identified with the NE01 firmware for E-X4146A and E-X4147A drives. As a precautionary measure, the NE01 firmware was removed from the NetApp Support Site on October 25th, 2022. If you downloaded the E-Series All Disk Firmware Bundle from the link below, before that date, please re-download if you have this drive. NE00 firmware is not impacted
- NetApp has implemented a drive firmware fix that can be upgraded to mitigate the issue.
- The updated firmware is available from the E/EF-Series Drive Firmware Download page on the NetApp Support site.
- Due to the nature of this issue, NetApp strongly recommends performing this upgrade from NE01 to NE03 in an offline mode
- No I/O or workload should occur on the system, volume, or DDP. This is a disruptive activity due to I/O being stopped.
- Please see the Solution section for more details on this before upgrading.
Update to minimum drive firmware NE03 for the affected drive part numbers and identification strings, below:
Part Number | Drive Identifier | Capacity | Firmware |
E-X4146A | WUH721818AL5204 | 18TB | NE03 |
E-X4147A | WUH721818AL5204 | 18TB | NE03 |
Issue Description
Drives on firmware NE01 are at risk for a failure due to high amount of illegal requests being reported after a disk firmware upgrade. Although this issue is very uncommon, it can potentially lead to data loss, disruption, or unavailability if multiple drives fail simultaneously.
Drives running NE00 are not susceptible to this issue.
Symptom
The most frequently observed failure is a 100A Major Event Log (MEL) with Event Specific Code of 05/20/00, as shown below:
After Drive Firmware Upgrade to NE01:
- Event code: 06/29/04 - Power-On, Reset, or Bus Device Reset Occurred
- Event code: 05/20/00 - Illegal Request (Invalid Command Operation Code)
Date/Time: 8/26/22, 12:23:31 AM
Sequence number: 1168083
Event type: 100A
Event category: Error
Priority: Informational
Event needs attention: false
Event send alert: false
Event visibility: true
Description: Drive returned CHECK CONDITION
Event specific codes: 5/20/0
Component type: Drive
Component location: Shelf 1, Drawer 4, Bay 1
Logged by: Controller in bay B
If reviewing the drive error logs (controller-drive-error-event-log), this is the current signature of the problem:
- Target Reported "AbtCmd 2904"
- Followed by
- Target Reported "IllReq 2000"
- After retries are exhausted this is expected to result in drive failure.
Solution
Per the above Summary, this is the approach recommended:
1. Use only the offline drive firmware upgrade tool.
2. Ensure the entire system with NE01 drives to be upgraded to NE03 is in an Optimal State, with no media scan running, no reconstructions happening, and all other system components functioning as expected.
3. Establish a maintenance window or other activity that removes active I/O and/or workload to any volume, DDP, or drives in the system.
StorageGRID appliances (SG6060):
Follow the same steps as the Upgrading SANtricity OS Software on the storage controllers using maintenance mode procedure documented in the 11.6 SG6000 maintenance guide. The difference being a drive FW upgrade is done while in maintenance mode instead of a SANtricity upgrade.
Additional Information
See BUG 1499485
In accordance with the Support Services terms, always update NetApp products with the latest version of firmware and software to provide the best reliability, availability, and serviceability:
- Download drive firmware from the E/EF- Series Drive and Firmware Matrix.
- Upgrade instructions: Upgrading drive firmware.
- Upgrade instructions for StorageGRID appliances using maintenance mode: Upgrading SANtricity OS Software on the storage controllers using maintenance mode
- The difference being a drive FW upgrade is done while in maintenance mode instead of a SANtricity upgrade.
Hot spare drives: To best maintain the continuous presence of hot spare drives available in the system, adhere to Hot Spares Best Practices and follow the standard drive replacement process if a drive fails.
Active IQ System Risk Detection: For customers who have enabled AutoSupport™ on their storage systems the Active IQ Portal provides detailed System Risk reports at the customer and site and system levels. The reports show systems that have specific risks as well as severity levels and mitigation action plans. Drives that are not running the latest firmware is an example of such a risk. Not upgrading to the most current drive firmware could leave the storage appliance vulnerable to undesirable behavior.
联想凌拓科技有限公司(“Lenovo NetApp”)不对本页面中提供的任何信息或建议的准确性、可靠性或可维护性,或通过使用这些信息或遵守本文中提供的建议可能获得的任何结果,提供任何陈述或保证。本页面中的信息是按原样分发的,使用这些信息或实施本文中的任何建议或技术是客户的责任,取决于客户评估这些信息并将其整合到客户的运营环境中的能力。本页面及其包含的信息只能与本页面中讨论的 NetApp 产品结合使用。在任何情况下,Lenovo NetApp 均不承担因与使用或执行本页面上提供的信息有关的或导致的任何特殊的、间接的或随之而来的任何损失,或者因使用、数据或利润损失(无论是否在合同履行中)、疏忽或其它侵权行为导致的任何损失。
更多最新信息请参考 NetApp 官网支持公告