SU536: [Impact Critical]: System disruption occurs on FAS systems with HDDs due to medium errors or recovered errors when running 9.12.1 versions prior to 9.12.1P4

Views:
2,464
Last Updated:
7/7/2023, 12:38:24 PM

收藏

Summary

[Impact Critical: Possible cluster data outage]

  • A software defect in ONTAP 9.12.1 can result in a system disruption for FAS systems with HDDs when reading data from a fragmented file system if disk media or recovered errors are experienced on one or more disks housing the aggregate being read from during the attempted read.
  • The issue is fixed in ONTAP 9.12.1P4.
  • Customers with FAS systems with HDDs running versions of ONTAP 9.12.1 earlier than 9.12.1P4 are strongly advised to upgrade those systems to ONTAP 9.12.1P4 in order to avoid a possible system disruption as a result of experiencing disk media errors.

Issue Description

In a fragmented file system, discontiguous read I/O blocks are padded with dummy blocks to avoid splitting the I/Os into multiple I/Os to improve performance. Because of a defect in ONTAP 9.12.1, if a disk involved in an aggregate data read from a fragmented file system returns medium errors or recovered errors on the dummy blocks, it is not repaired by the ONTAP RAID layer, and a loop of retrying the same read I/Os is created. As a result, a system disruption occurs.

Note: As of time of writing, this issue has only been reported on FAS systems with HDDs.

Symptom

The storage appliance will experience a node panic with a panic string similar to the following:
WAFL hung for <aggregate name>. in SK process wafl_exempt<nn> on release 9.12.1

Prior to the panic, media errors for one or more drives associated with the aggregate will be reported by the storage appliance.

Workaround

None. However, replacing the drive(s) reporting errors will prevent further issues caused by continued errors on those drives.

Solution

Upgrade to ONTAP 9.12.1P4 (or later as available).

Additional Information

BUG ID 1524092