SU539: [Impact Critical] Deswizzler scan might cause system controller disruption when issuing cloud reads to a FabricPool capacity tier.

Views:
168
Last Updated:
8/29/2023, 12:57:38 AM

收藏

Summary

[Impact Critical: Possible cluster data outage]

A bug in specific versions of ONTAP could result in a controller disruption and possible cluster data outage when the cluster is configured to use FabricPool and certain workflows trigger the running of a file system scanner called the deswizzler while the S3 data store hosting the FabricPool capacity tier is unavailable.

Issue Description

This issue can occur on an ONTAP cluster under the following conditions...

  • Cluster is running one of the following ONTAP versions:
    • ONTAP 9.8x: 9.8P17, 9.8P18, 9.8P19
    • ONTAP 9.9.1x: 9.9.1P14, 9.9.1P15, 9.9.1P16
    • ONTAP 9.10.1x: 9.10.1P11, 9.10.1P12, 9.10.1P13
  • Cluster is configured to utilize NetApp's FabricPool technology.
  • A WAFL background scanner (known as the deswizzler scanner) is run against the FabricPool volume as a result of one of the following workflows being used in conjunction with the FabricPool volume
    • SnapMirror DP (BRE)
    • Vol Move
    • Clone Split
  • The FabricPool volume has data tiered to a capacity tier that is located on a remotely hosted S3 data store.
  • The S3 data store hosting the capacity tier is unavailable while the deswizzler scanner is attempting to run against the FabricPool volume.

Symptom

If encountered, this issue will result in a controller disruption with a panic string similar to the following being reported:

PANIC: Cannot run exception without flag set! message type=WAFL_SCAN_DESWIZZLE flags=1 in SK process wafl_exempt<nn> on release 9.x.yPx (C)

Workaround

Disable the WAFL deswizzler scanner

  • Cluster::> node run -node [node name] "options wafl.deswizzle.enable off"

Note that disabling the WAFL deswizzler scanner may result in some performance impact.

After upgrading to a release with the fix for Bug ID 1314405, the WAFL deswizzler scanner should be re-enabled.

  • Cluster::> node run -node [node name] "options wafl.deswizzle.enable on"

Solution

Upgrade to a release of ONTAP where Bug ID 1314405 is fixed:

  • For 9.8x, 9.8P20 will be the first release with the fix for Bug ID 1314405
  • For 9.9.1x, 9.9.1P17 will be the first release with the fix for Bug ID 1314405
  • For 9.10.1x, 9.10.1P14 is the first release with the fix for Bug ID 1314405
  • In addition, the fix for Bug ID 1314405 is included in ONTAP 9.11.1x and all subsequent releases.

Additional Information

Public Report

The following KB article contains more information:

FabricPool Best Practices Technical Report

Active IQ System Risk Detection:

For customers who have enabled AutoSupport on their storage systems the Active IQ Portal provides detailed System Risk reports at the customer and site and system levels. The reports show systems that have specific risks as well as severity levels and mitigation action plans.

Important: The purpose of this communication is for NetApp to notify its installed base end users about urgent and important product information that may affect product performance or reliability. The information contained herein and the distribution lists are NetApp confidential materials that are subject to restrictions on redistribution and that cannot be shared outside of this e-mail distribution list.

***************************************************
*** NETAPP CONFIDENTIAL – FOR LIMITED USE ONLY ***
***************************************************