SU399: [Impact: Critical] Controller disruption and data loss might occur on volumes with systems running ONTAP 9.5 or later

Views:
334
Last Updated:
4/7/2022, 3:55:38 AM

收藏

Summary

[Impact: Critical – Potential Data Loss/Corruption]

Possible user data loss or corruption on a volume on systems running 9.5 or later if FabricPool and copy-offload is being used. This data corruption, in some cases, can also lead to a controller disruption.

Issue Description

It is possible to experience user data loss or corruption in volumes hosted on systems running ONTAP 9.5 or later where compression and deduplication are enabled on those volumes (on by default for AFF) and FabricPool is in use, and where data is being accessed by clients utilizing sub-file clone operations (Microsoft Offloaded Data Transfer (ODX) or VMware vStorage API for Array Integration (VAAI) Xcopy). This data corruption can, in some cases, also lead to controller disruption.

This issue only causes corruption when the data being accessed via Microsoft ODX or VMware VAAI Xcopy has been tiered-out to the capacity tier by FabricPool.

The sub-file clone feature is used for copy-offload in case of intra-volume copy or VM clone, and it is initiated by client application. There are two callers of subfile-clone copy-offload in ONTAP.

  1. VAAI XCOPY: Invoked by VmWare VAAI VM clone or copy operations. If both the source and destination VM of the cloning/copy are in the same volume, ONTAP uses subfile clones for copy-offload. Only applicable to VmWare SAN environments where VAAI is enabled.
  2. Microsoft ODX: Used by windows copy operations. If the source and destination of the copy are in the same volume, ONTAP uses subfile clones for copy-offload. Applicable to both SAN and CIFS windows environments, where ODX is enabled.

The following sequence of operations can result in this problem:

  1. Blocks on a file get compressed in the extended-compression format in ONTAP 9.5 or later.
  2. These blocks are tiered-out to capacity-tier by FabricPool.
  3. Copy-offload operation is done, using this file as source, to some other destination file in the same volume. Internally ONTAP does subfile-clone operation from source file to destination file for copy-offload request.
  4. On the destination file, for the compressed blocks which are tiered-out, extended-compression format information is lost.
  5. Subsequent read to these blocks on destination file would return the data without uncompressing it. In some cases where the compressed block is deduplicated with a block in another volume, it can result in “inconsistent user data” messages and a controller disruption.

Symptom

In the event of this issue causing a data corruption, customers may see the following:

  • AutoSupport

An alert with the header “HA Group Notification (WAFL INCONSISTENT USER DATA BLOCK) ALERT”

  • Event Management System (EMS) log updates with messages like
Mon Jan 1 00:00:00 GMT [FILER-01: wafl_exempt12: wafl.raid.incons.userdata:error]: WAFL inconsistent: inconsistent user data block at VBN 12345 (vvbn:12345 fbn:12345 level:0) in public inode (fileid:12345 snapid:0 file_type:1 disk_flags:0x9002 error:120 raid_set:1) in volume Volumename@vserver:.

Workaround

The copy-offload feature in ONTAP should be disabled as a workaround for this problem. This workaround will prevent new corruptions from happening in the event of the above sequence of operations occurring again.

To disable copy offload, use the following commands:

  • VAAI XCOPY: vserver copy-offload modify -subfile-sisclone disabled (diag mode command - contact the NetApp Support team for assistance)
  • Microsoft ODX: vserver cifs options modify -vserver vserver_name -copy-offload-enabled false

Disabling copy-offload in ONTAP will cause all copy and VM clone operations to automatically fall back to host-based copy/cloning mechanisms at the application level.

Be aware that for SAN environments, disabling copy offload will require taking LUNs offline first. It may be more palatable for SAN customers to disable VAAI XCOPY or Microsoft ODX on the host system(s) to avoid any service interruption.

Solution

This issue is fixed in the following ONTAP releases: ONTAP 9.5P10 (expected 12/20/2019), ONTAP 9.6P5 (available as of 12/18/2019), and ONTAP 9.7 (ETA:2nd half of Jan 2020). Exposed customers are strongly advised to upgrade to a release of ONTAP where this issue is fixed as soon as is operationally feasible.

After upgrading to a fixed release, the above workaround (if applied) can be removed (re-enable copy-offload)

Please reach out to the NetApp Support team if you need any assistance.

Additional Information

See Bug ID 1280392