SU399: [Impact: Critical] Controller disruption and data loss might occur on volumes with systems running ONTAP 9.5 or later
- Views:
- 334
- Last Updated:
- 4/7/2022, 3:55:38 AM
收藏
Summary
[Impact: Critical – Potential Data Loss/Corruption]
Possible user data loss or corruption on a volume on systems running 9.5 or later if FabricPool and copy-offload is being used. This data corruption, in some cases, can also lead to a controller disruption.
Issue Description
It is possible to experience user data loss or corruption in volumes hosted on systems running ONTAP 9.5 or later where compression and deduplication are enabled on those volumes (on by default for AFF) and FabricPool is in use, and where data is being accessed by clients utilizing sub-file clone operations (Microsoft Offloaded Data Transfer (ODX) or VMware vStorage API for Array Integration (VAAI) Xcopy). This data corruption can, in some cases, also lead to controller disruption.
This issue only causes corruption when the data being accessed via Microsoft ODX or VMware VAAI Xcopy has been tiered-out to the capacity tier by FabricPool.
The sub-file clone feature is used for copy-offload in case of intra-volume copy or VM clone, and it is initiated by client application. There are two callers of subfile-clone copy-offload in ONTAP.
- VAAI XCOPY: Invoked by VmWare VAAI VM clone or copy operations. If both the source and destination VM of the cloning/copy are in the same volume, ONTAP uses subfile clones for copy-offload. Only applicable to VmWare SAN environments where VAAI is enabled.
- Microsoft ODX: Used by windows copy operations. If the source and destination of the copy are in the same volume, ONTAP uses subfile clones for copy-offload. Applicable to both SAN and CIFS windows environments, where ODX is enabled.
The following sequence of operations can result in this problem:
- Blocks on a file get compressed in the extended-compression format in ONTAP 9.5 or later.
- These blocks are tiered-out to capacity-tier by FabricPool.
- Copy-offload operation is done, using this file as source, to some other destination file in the same volume. Internally ONTAP does subfile-clone operation from source file to destination file for copy-offload request.
- On the destination file, for the compressed blocks which are tiered-out, extended-compression format information is lost.
- Subsequent read to these blocks on destination file would return the data without uncompressing it. In some cases where the compressed block is deduplicated with a block in another volume, it can result in “inconsistent user data” messages and a controller disruption.
Symptom
In the event of this issue causing a data corruption, customers may see the following:
- AutoSupport
An alert with the header “HA Group Notification (WAFL INCONSISTENT USER DATA BLOCK) ALERT”
- Event Management System (EMS) log updates with messages like
Mon Jan 1 00:00:00 GMT [FILER-01: wafl_exempt12: wafl.raid.incons.userdata:error]: WAFL inconsistent: inconsistent user data block at VBN 12345 (vvbn:12345 fbn:12345 level:0) in public inode (fileid:12345 snapid:0 file_type:1 disk_flags:0x9002 error:120 raid_set:1) in volume Volumename@vserver:.
Workaround
The copy-offload feature in ONTAP should be disabled as a workaround for this problem. This workaround will prevent new corruptions from happening in the event of the above sequence of operations occurring again.
To disable copy offload, use the following commands:
- VAAI XCOPY:
vserver copy-offload modify -subfile-sisclone disabled
(diag mode command - contact the NetApp Support team for assistance) - Microsoft ODX:
vserver cifs options modify -vserver vserver_name -copy-offload-enabled false
Disabling copy-offload in ONTAP will cause all copy and VM clone operations to automatically fall back to host-based copy/cloning mechanisms at the application level.
Be aware that for SAN environments, disabling copy offload will require taking LUNs offline first. It may be more palatable for SAN customers to disable VAAI XCOPY or Microsoft ODX on the host system(s) to avoid any service interruption.
Solution
This issue is fixed in the following ONTAP releases: ONTAP 9.5P10 (expected 12/20/2019), ONTAP 9.6P5 (available as of 12/18/2019), and ONTAP 9.7 (ETA:2nd half of Jan 2020). Exposed customers are strongly advised to upgrade to a release of ONTAP where this issue is fixed as soon as is operationally feasible.
After upgrading to a fixed release, the above workaround (if applied) can be removed (re-enable copy-offload)
Please reach out to the NetApp Support team if you need any assistance.
Additional Information
See Bug ID 1280392
联想凌拓科技有限公司(“Lenovo NetApp”)不对本页面中提供的任何信息或建议的准确性、可靠性或可维护性,或通过使用这些信息或遵守本文中提供的建议可能获得的任何结果,提供任何陈述或保证。本页面中的信息是按原样分发的,使用这些信息或实施本文中的任何建议或技术是客户的责任,取决于客户评估这些信息并将其整合到客户的运营环境中的能力。本页面及其包含的信息只能与本页面中讨论的 NetApp 产品结合使用。在任何情况下,Lenovo NetApp 均不承担因与使用或执行本页面上提供的信息有关的或导致的任何特殊的、间接的或随之而来的任何损失,或者因使用、数据或利润损失(无论是否在合同履行中)、疏忽或其它侵权行为导致的任何损失。
更多最新信息请参考 NetApp 官网支持公告