SU531: [Impact Critical] Upgrading to ONTAP 9.12.1 might result in Replicated Database (RDB) inconsistency for the Volume Location Database (VLDB)

Views:: 1,382

Last Updated:: 6/23/2023, 4:40:46 AM

免责声明：
联想凌拓科技有限公司（“Lenovo NetApp”）不对本页面中提供的任何信息或建议的准确性、可靠性或可维护性，或通过使用这些信息或遵守本文中提供的建议可能获得的任何结果，提供任何陈述或保证。本页面中的信息是按原样分发的，使用这些信息或实施本文中的任何建议或技术是客户的责任，取决于客户评估这些信息并将其整合到客户的运营环境中的能力。本页面及其包含的信息只能与本页面中讨论的 NetApp 产品结合使用。在任何情况下，Lenovo NetApp 均不承担因与使用或执行本页面上提供的信息有关的或导致的任何特殊的、间接的或随之而来的任何损失，或者因使用、数据或利润损失（无论是否在合同履行中）、疏忽或其它侵权行为导致的任何损失。

更多最新信息请参考 NetApp 官网支持公告

Summary

[Impact Critical: Cluster data outage]

Systems that have been upgraded to ONTAP 9.12.1RC1, 9.12.1RC1P1, 9.12.1, or 9.12.1P1 may have mismatched data in ONTAP's Replicated Database (RDB) that is used to ensure consistency of volume operations across an ONTAP cluster. This inconsistency can result in FlexGroup volumes being taken offline, and under some circumstances can also result in a system disruption. Customers that have upgraded from a pre-ONTAP 9.12.1 release to either ONTAP 9.12.1RC1, 9.12.1RC1P1, 9.12.1, or 9.12.1P1 are advised to upgrade to ONTAP 9.12.1P4 (or later, as available) where this issue is addressed.

Note that this issue only happens when upgrading a cluster to an unfixed ONTAP 9.12.1 release from an earlier (pre-9.12.1) ONTAP release. Clusters initially deployed on an ONTAP 9.12.1 release are not impacted by this issue, but because of other issues seen on ONTAP 9.12.1, it is strongly recommended to upgrade to ONTAP 9.12.1P4 (or later, as available).

Issue Description

Upon upgrading to ONTAP 9.12.1 from a pre-ONTAP 9.12.1 release, the Replicated Database (RDB) replica for the Volume Location Database (VLDB) might become inconsistent on one node in the cluster. If a write transaction occurs on specific VLDB tables, this inconsistency results in database divergence on that node compared to the rest of the cluster. Normal management operations or internal ONTAP workflows might fail when executed on nodes where the data has diverged. This issue is tracked under BUG ID 1539225.

One confirmed impact revolves around the use of FlexGroups. If new FlexGroups are created after the database divergence occurs, it is possible that some of those FlexGroup constituents might become inconsistent and eventually be taken offline, or might cause aggregates to go offline with a disruption.

Symptom

If FlexGroup operation is impacted because of this issue, the following might be observed:

FlexGroup members might randomly go offline, and the following EMS messages will be emitted:

Mon Feb 27 12:13:23 -0500 [cluster01-1: vv_config_worker17: wafl.vvol.offline:info]: Volume 'data__0001@vserver:12565bea-c927-11eb-831a-d039ea2a7cd3' has been set temporarily offline Mon Feb 27 12:13:25 -0500 [cluster01-2: vv_config_worker06: wafl.vvol.offline:info]: Volume 'data__0002@vserver:12565bea-c927-11eb-831a-d039ea2a7cd3' has been set temporarily offline Mon Feb 27 12:13:27 -0500 [cluster01-3: vv_config_worker16: wafl.vvol.offline:info]: Volume 'data__0003@vserver:12565bea-c927-11eb-831a-d039ea2a7cd3' has been set temporarily offline Mon Feb 27 12:13:30 -0500 [cluster01-4: vv_config_worker03: wafl.vvol.offline:info]: Volume 'data__0004@vserver:12565bea-c927-11eb-831a-d039ea2a7cd3' has been set temporarily offline

In addition, any volume level operations on those volumes will fail as below:

cluster01::*> volume online -vserver vs1 -volume data vol online: Error onlining volume "vs1:data". Modification not permitted. Unable to determine the state of the volume. This can be because of network connectivity problems or the volume is busy.

cluster01::*> vol offline -vserver vs1 -volume data vol offline: Error offlining volume "vs1:data". Modification not permitted. Unable to determine the state of the volume. This can be because of network connectivity problems or the volume is busy.

FlexVol rehosting operations might be impacted because of this issue. It is possible that FlexVol rehosting might get stuck at initialized state. The following is an example of an impacted stuck FlexVol being rehosted:

cluster01::*> job show -id 24572 -instance Job ID: 24572 Owning Vserver: svm1 Name: volume rehost Description: Volume rehost operation on volume "nfs_data" on Vserver "svm1" to destination Vserver "svm2" by administrator "mycompany\myadmin" Priority: High Node: cluster01-02 Affinity: Cluster Schedule: @now Queue Time: 03/23 15:03:29 Start Time: 03/23 15:03:32 End Time: - Drop-dead Time: - Restarted?: false State: Running Status Code: 0 Completion String: Job Type: VOLUME_REHOST Job Category: VOPL UUID: 8b070642-c9c6-11ed-9419-d039ea3b296e Execution Progress: Initialized... User Name: mycompany\myadmin Restart Is Delayed by Module: -

Customers running ONTAP 9.12.1RC1, 9.12.1RC1P1, 9.12.1, or 9.12.1P1 and who experience symptoms similar to those documented above should contact NetApp Technical Support for further assistance.

Workaround

If impacted by this issue while running ONTAP 9.12.1RC1, 9.12.1RC1P1, 9.12.1, or 9.12.1P1, and unable to upgrade to ONTAP 9.12.1P4 or above, contact NetApp Technical Support for further guidance. Otherwise, upgrading to ONTAP 9.12.1P4 will avoid the issues as documented in this bulletin. Note that even though this issue is fixed in ONTAP 9.12.1P2 and above, because of other issues seen on ONTAP 9.12.1, it is strongly recommended to upgrade to ONTAP 9.12.1P4 (or later, as available).

Solution

As soon as is operationally feasible, upgrade ONTAP systems running ONTAP 9.12.1RC1, 9.12.1RC1P1, 9.12.1, or 9.12.1P1 to ONTAP 9.12.1P4 (or later, as available) to avoid the issues documented in this bulletin.

Upgrading from an earlier major (pre-9.12.1) version to ONTAP 9.12.1P4 or above will avoid this problem, as the upgrade code included with the release has been updated to eliminate the cause of this issue.
So long as symptoms (of FlexGroup constituents becoming inconsistent/offline) have not already been experienced, upgrading from one of the impacted ONTAP 9.12.1 releases to ONTAP 9.12.1P4 will invoke an internal process that will repair any RDB inconsistency identified during the upgrade and will prevent subsequent problems.
FlexGroups that have already been impacted by this issue will need manual recovery with the assistance of NetApp Technical Support before upgrading to ONTAP 9.12.1P4.
Note that although this issue is resolved in ONTAP 9.12.1P2 and above, because of other issues seen on ONTAP 9.12.1, it is strongly recommended to upgrade to ONTAP 9.12.1P4 (or later, as available).

Additional Information

Public Report: BUG ID 1539225

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.