SU507: [Impact: Critical] Controller disruption and potential data inconsistency with FabricPool tiering within the same FAS HA Pair
- Views:
- 510
- Last Updated:
- 10/31/2022, 9:09:08 PM
收藏
Summary
[Impact Critical: Potential data loss]
A controller disruption and data inconsistency might be experienced in configurations where an ONTAP S3 bucket is configured as the target for FabricPool tiering within the same controller or the same high-availability (HA) pair. For HA configurations, the issue might be experienced with FabricPool activity when a single controller hosts both ONTAP S3 and FabricPool while it has taken over its HA partner. Note that such a configuration is not recommended for performance reasons.
This issue exists in the following releases (inclusive):
- ONTAP 9.8RC1 thru 9.8P12
- ONTAP 9.9.1RC1 thru 9.9.1P9
This issue does not impact AFF controllers, nor does it impact cloud-based ONTAP deployments (Cloud Volumes ONTAP, AWS FSx).
Issue Description
In a situation where the local FabricPool tier and the local S3 bucket are hosted on the same FAS controller (either as a single controller configuration, or an HA pair where one controller has failed over to the other) and if there is an active FabricPool policy, the controller hosting both the local tier and the local S3 bucket might experience a controller disruption and data inconsistency in user data blocks.
This issue only impacts FabricPool configurations where the possibility exists of having the same controller host both the local tier and the local S3 bucket, which is not a recommended configuration. In addition, this issue only impacts controllers running ONTAP 9.8 or 9.9.1 without the fix for Bug 1357686. Systems running ONTAP 9.10.1 or later are not exposed to this problem, even if a non-optimal FabricPool configuration is in place.
Note that the functionality that enables this configuration is only available as of ONTAP 9.8. Controllers running earlier releases of ONTAP are not exposed to this issue.
For more information, see Bug 1357686.
Symptom
When both the local tier and the local S3 bucket are hosted on the same controller and an active FabricPool policy is in place, any activity that results in data being tiered to the local S3 bucket can result in the following:
Controller disruption with the following panic message
Host memory checksum mismatch on WRITE VERIFY: Disk <disk_ID>, Disk Block #XXXX: Volume <Volume_name>, FileId XXX,File Block #XXX: Expected 0xYYYYYYYY, Recomputed as 0xZZZZZZZZ in SK process disk_server_0 on release 9.X (C)
In addition, RAID scrubbing by the controller after the disruption might report parity errors that reference the aggregate hosting the local S3 bucket:
/<cloud_tier_aggregate>/plex0/rg1/0c.23.8 Shelf 23 Bay 8 [NETAPP X343_TA15E1T8A10 NA01] S/N [XXX] UID [5000039B:3840A21C:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000]', 'site': 'Local', 'bay': '8', 'carrier': '', 'serialno': 'XXX', 'owner': '', 'model': 'X343_TA15E1T8A10', 'disk_type': '4', 'blockNum': '17612'}
[node-02: raidio_thread: raid_rg_readerr_repair_cksum_stored_1:notice]: params: {'disk_rpm': '10000', 'vendor': 'NETAPP ', 'firmware_revision': 'NA01', 'shelf': '23', 'disk_info': 'Disk /<cloud_tier_aggregate>/plex0/rg1/0c.23.8 Shelf 23 Bay 8 [NETAPP X343_TA15E1T8A10 NA01] S/N [XXX] UID [5000039B:3840A21C:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000]', 'site': 'Local', 'bay': '8', 'carrier': '', 'serialno': 'XXX', 'owner': '', 'model': 'X343_TA15E1T8A10', 'disk_type': '4', 'blockNum': '17612'}
It is also possible that inconsistent user data blocks that reference a volume name on the aggregate hosting the local tier might be detected:
[node-01: wafl_exempt12: wafl.raid.incons.userdata:error]: WAFL inconsistent: inconsistent user data block at VBN XXX (vvbn:XXX fbn:XXX level:0) in public inode (fileid:XXX snapid:0 file_type:15 disk_flags:0x8402 error:120 raid_set:1) in volume <volume_name>@vserver:<Vserver_UUID>.
If this issue (with the above symptoms) is encountered, open a case with NetApp Technical Support for further assistance.
If the issue has not been encountered, follow the guidance in the Workaround section of this bulletin, or preferably upgrade to a version of ONTAP where Bug 1357686 is fixed.
Workaround
To avoid this issue, do not use FabricPool configurations where the local tier and the local S3 bucket could be hosted on the same controller. Where this is not possible, two workaround options exist.
- A workaround exists where changing a low-level system configuration option can avoid this issue – contact NetApp Technical Support referencing Bug 1357686 for details.
- Configure the FabricPool ONTAP S3 object store to use HTTPS, rather than HTTP (details can be found in this knowledgebase article).
The preferred solution is to upgrade to a version of ONTAP where Bug 1357686 is fixed.
Solution
Upgrade to a version of ONTAP that has the fix for Bug 1357686.
Bug 1357686 is (or will be) fixed in the following ONTAP releases:
- ONTAP 9.8P13 (available now)
- ONTAP 9.9.1P10 (available now)
- ONTAP 9.10.1 (available now [note, 9.10.1P3 or later preferred]), and all subsequent releases
Note: All ETAs are subject to change. This bulletin will be updated as dates become final or change.
Additional Information
- For more information, see Bug 1357686
- S3 in ONTAP Best Practices Guide
联想凌拓科技有限公司(“Lenovo NetApp”)不对本页面中提供的任何信息或建议的准确性、可靠性或可维护性,或通过使用这些信息或遵守本文中提供的建议可能获得的任何结果,提供任何陈述或保证。本页面中的信息是按原样分发的,使用这些信息或实施本文中的任何建议或技术是客户的责任,取决于客户评估这些信息并将其整合到客户的运营环境中的能力。本页面及其包含的信息只能与本页面中讨论的 NetApp 产品结合使用。在任何情况下,Lenovo NetApp 均不承担因与使用或执行本页面上提供的信息有关的或导致的任何特殊的、间接的或随之而来的任何损失,或者因使用、数据或利润损失(无论是否在合同履行中)、疏忽或其它侵权行为导致的任何损失。
更多最新信息请参考 NetApp 官网支持公告