SU531: [Impact Critical] Upgrading to ONTAP 9.12.1 might result in Replicated Database (RDB) inconsistency for the Volume Location Database (VLDB)

Views:
1,382
Last Updated:
6/23/2023, 4:40:46 AM

收藏

Summary

[Impact Critical: Cluster data outage]

Systems that have been upgraded to ONTAP 9.12.1RC1, 9.12.1RC1P1, 9.12.1, or 9.12.1P1 may have mismatched data in ONTAP's Replicated Database (RDB) that is used to ensure consistency of volume operations across an ONTAP cluster. This inconsistency can result in FlexGroup volumes being taken offline, and under some circumstances can also result in a system disruption. Customers that have upgraded from a pre-ONTAP 9.12.1 release to either ONTAP 9.12.1RC1, 9.12.1RC1P1, 9.12.1, or 9.12.1P1 are advised to upgrade to ONTAP 9.12.1P4 (or later, as available) where this issue is addressed.

Note that this issue only happens when upgrading a cluster to an unfixed ONTAP 9.12.1 release from an earlier (pre-9.12.1) ONTAP release. Clusters initially deployed on an ONTAP 9.12.1 release are not impacted by this issue, but because of other issues seen on ONTAP 9.12.1, it is strongly recommended to upgrade to ONTAP 9.12.1P4 (or later, as available).

Issue Description

Upon upgrading to ONTAP 9.12.1 from a pre-ONTAP 9.12.1 release, the Replicated Database (RDB) replica for the Volume Location Database (VLDB) might become inconsistent on one node in the cluster. If a write transaction occurs on specific VLDB tables, this inconsistency results in database divergence on that node compared to the rest of the cluster. Normal management operations or internal ONTAP workflows might fail when executed on nodes where the data has diverged. This issue is tracked under BUG ID 1539225.

One confirmed impact revolves around the use of FlexGroups. If new FlexGroups are created after the database divergence occurs, it is possible that some of those FlexGroup constituents might become inconsistent and eventually be taken offline, or might cause aggregates to go offline with a disruption.

Symptom

If FlexGroup operation is impacted because of this issue, the following might be observed:

FlexGroup members might randomly go offline, and the following EMS messages will be emitted:

Mon Feb 27 12:13:23 -0500 [cluster01-1: vv_config_worker17: wafl.vvol.offline:info]: Volume 'data__0001@vserver:12565bea-c927-11eb-831a-d039ea2a7cd3' has been set temporarily offline
Mon Feb 27 12:13:25 -0500 [cluster01-2: vv_config_worker06: wafl.vvol.offline:info]: Volume 'data__0002@vserver:12565bea-c927-11eb-831a-d039ea2a7cd3' has been set temporarily offline
Mon Feb 27 12:13:27 -0500 [cluster01-3: vv_config_worker16: wafl.vvol.offline:info]: Volume 'data__0003@vserver:12565bea-c927-11eb-831a-d039ea2a7cd3' has been set temporarily offline
Mon Feb 27 12:13:30 -0500 [cluster01-4: vv_config_worker03: wafl.vvol.offline:info]: Volume 'data__0004@vserver:12565bea-c927-11eb-831a-d039ea2a7cd3' has been set temporarily offline

In addition, any volume level operations on those volumes will fail as below:

cluster01::*> volume online -vserver vs1 -volume data
vol online: Error onlining volume "vs1:data".
Modification not permitted. Unable to determine the state of the volume. This can be because of network connectivity problems or the volume is busy.

cluster01::*> vol offline -vserver vs1 -volume data
vol offline: Error offlining volume "vs1:data".
Modification not permitted. Unable to determine the state of the volume. This can be because of network connectivity problems or the volume is busy.

FlexVol rehosting operations might be impacted because of this issue. It is possible that FlexVol rehosting might get stuck at initialized state. The following is an example of an impacted stuck FlexVol being rehosted:

cluster01::*> job show -id 24572 -instance

             Job ID: 24572
         Owning Vserver: svm1
              Name: volume rehost
           Description: Volume rehost operation on volume "nfs_data" on Vserver "svm1" to destination Vserver "svm2" by administrator "mycompany\myadmin"
            Priority: High
              Node: cluster01-02
            Affinity: Cluster
            Schedule: @now
           Queue Time: 03/23 15:03:29
           Start Time: 03/23 15:03:32
            End Time: -
         Drop-dead Time: -
           Restarted?: false
             State: Running
          Status Code: 0
       Completion String:
            Job Type: VOLUME_REHOST
          Job Category: VOPL
              UUID: 8b070642-c9c6-11ed-9419-d039ea3b296e
       Execution Progress: Initialized...
            User Name: mycompany\myadmin
Restart Is Delayed by Module: -

Customers running ONTAP 9.12.1RC1, 9.12.1RC1P1, 9.12.1, or 9.12.1P1 and who experience symptoms similar to those documented above should contact NetApp Technical Support for further assistance.

Workaround

If impacted by this issue while running ONTAP 9.12.1RC1, 9.12.1RC1P1, 9.12.1, or 9.12.1P1, and unable to upgrade to ONTAP 9.12.1P4 or above, contact NetApp Technical Support for further guidance. Otherwise, upgrading to ONTAP 9.12.1P4 will avoid the issues as documented in this bulletin. Note that even though this issue is fixed in ONTAP 9.12.1P2 and above, because of other issues seen on ONTAP 9.12.1, it is strongly recommended to upgrade to ONTAP 9.12.1P4 (or later, as available).

Solution

As soon as is operationally feasible, upgrade ONTAP systems running ONTAP 9.12.1RC1, 9.12.1RC1P1, 9.12.1, or 9.12.1P1 to ONTAP 9.12.1P4 (or later, as available) to avoid the issues documented in this bulletin.

  • Upgrading from an earlier major (pre-9.12.1) version to ONTAP 9.12.1P4 or above will avoid this problem, as the upgrade code included with the release has been updated to eliminate the cause of this issue.
  • So long as symptoms (of FlexGroup constituents becoming inconsistent/offline) have not already been experienced, upgrading from one of the impacted ONTAP 9.12.1 releases to ONTAP 9.12.1P4 will invoke an internal process that will repair any RDB inconsistency identified during the upgrade and will prevent subsequent problems.
  • FlexGroups that have already been impacted by this issue will need manual recovery with the assistance of NetApp Technical Support before upgrading to ONTAP 9.12.1P4.
  • Note that although this issue is resolved in ONTAP 9.12.1P2 and above, because of other issues seen on ONTAP 9.12.1, it is strongly recommended to upgrade to ONTAP 9.12.1P4 (or later, as available).

Additional Information

Public Report: BUG ID 1539225