SU507: [Impact: Critical] Controller disruption and potential data inconsistency with FabricPool tiering within the same FAS HA Pair

Views:
510
Last Updated:
10/31/2022, 9:09:08 PM

收藏

Summary

[Impact Critical: Potential data loss]

A controller disruption and data inconsistency might be experienced in configurations where an ONTAP S3 bucket is configured as the target for FabricPool tiering within the same controller or the same high-availability (HA) pair. For HA configurations, the issue might be experienced with FabricPool activity when a single controller hosts both ONTAP S3 and FabricPool while it has taken over its HA partner. Note that such a configuration is not recommended for performance reasons.

This issue exists in the following releases (inclusive):

  • ONTAP 9.8RC1 thru 9.8P12
  • ONTAP 9.9.1RC1 thru 9.9.1P9

This issue does not impact AFF controllers, nor does it impact cloud-based ONTAP deployments (Cloud Volumes ONTAP, AWS FSx).

Issue Description

In a situation where the local FabricPool tier and the local S3 bucket are hosted on the same FAS controller (either as a single controller configuration, or an HA pair where one controller has failed over to the other) and if there is an active FabricPool policy, the controller hosting both the local tier and the local S3 bucket might experience a controller disruption and data inconsistency in user data blocks.

This issue only impacts FabricPool configurations where the possibility exists of having the same controller host both the local tier and the local S3 bucket, which is not a recommended configuration. In addition, this issue only impacts controllers running ONTAP 9.8 or 9.9.1 without the fix for Bug 1357686. Systems running ONTAP 9.10.1 or later are not exposed to this problem, even if a non-optimal FabricPool configuration is in place.

Note that the functionality that enables this configuration is only available as of ONTAP 9.8. Controllers running earlier releases of ONTAP are not exposed to this issue.

For more information, see Bug 1357686.

Symptom

When both the local tier and the local S3 bucket are hosted on the same controller and an active FabricPool policy is in place, any activity that results in data being tiered to the local S3 bucket can result in the following:

Controller disruption with the following panic message

Host memory checksum mismatch on WRITE VERIFY: Disk <disk_ID>, Disk Block #XXXX: Volume <Volume_name>, FileId XXX,File Block #XXX: Expected 0xYYYYYYYY, Recomputed as 0xZZZZZZZZ in SK process disk_server_0 on release 9.X (C)

In addition, RAID scrubbing by the controller after the disruption might report parity errors that reference the aggregate hosting the local S3 bucket:

/<cloud_tier_aggregate>/plex0/rg1/0c.23.8 Shelf 23 Bay 8 [NETAPP   X343_TA15E1T8A10 NA01] S/N [XXX] UID [5000039B:3840A21C:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000]', 'site': 'Local', 'bay': '8', 'carrier': '', 'serialno': 'XXX', 'owner': '', 'model': 'X343_TA15E1T8A10', 'disk_type': '4', 'blockNum': '17612'}

[node-02: raidio_thread: raid_rg_readerr_repair_cksum_stored_1:notice]: params: {'disk_rpm': '10000', 'vendor': 'NETAPP  ', 'firmware_revision': 'NA01', 'shelf': '23', 'disk_info': 'Disk /<cloud_tier_aggregate>/plex0/rg1/0c.23.8 Shelf 23 Bay 8 [NETAPP   X343_TA15E1T8A10 NA01] S/N [XXX] UID [5000039B:3840A21C:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000]', 'site': 'Local', 'bay': '8', 'carrier': '', 'serialno': 'XXX', 'owner': '', 'model': 'X343_TA15E1T8A10', 'disk_type': '4', 'blockNum': '17612'}

It is also possible that inconsistent user data blocks that reference a volume name on the aggregate hosting the local tier might be detected:

[node-01: wafl_exempt12: wafl.raid.incons.userdata:error]: WAFL inconsistent: inconsistent user data block at VBN XXX (vvbn:XXX fbn:XXX level:0) in public inode (fileid:XXX snapid:0 file_type:15 disk_flags:0x8402 error:120 raid_set:1) in volume <volume_name>@vserver:<Vserver_UUID>.

If this issue (with the above symptoms) is encountered, open a case with NetApp Technical Support for further assistance.

If the issue has not been encountered, follow the guidance in the Workaround section of this bulletin, or preferably upgrade to a version of ONTAP where Bug 1357686 is fixed.

Workaround

To avoid this issue, do not use FabricPool configurations where the local tier and the local S3 bucket could be hosted on the same controller. Where this is not possible, two workaround options exist.

  1. A workaround exists where changing a low-level system configuration option can avoid this issue – contact NetApp Technical Support referencing Bug 1357686 for details.
  2. Configure the FabricPool ONTAP S3 object store to use HTTPS, rather than HTTP (details can be found in this knowledgebase article).

The preferred solution is to upgrade to a version of ONTAP where Bug 1357686 is fixed.

Solution

Upgrade to a version of ONTAP that has the fix for Bug 1357686.

Bug 1357686 is (or will be) fixed in the following ONTAP releases:

  • ONTAP 9.8P13 (available now)
  • ONTAP 9.9.1P10 (available now)
  • ONTAP 9.10.1 (available now [note, 9.10.1P3 or later preferred]), and all subsequent releases

Note: All ETAs are subject to change. This bulletin will be updated as dates become final or change.

Additional Information