SU538: [Impact: High] System disruption from excessive internal mount points after upgrade

Views:
425
Last Updated:
8/25/2023, 3:01:40 AM

收藏

Summary

[Impact High: Node disruption]

A defect in ONTAP Automated Non-Disruptive Upgrade (ANDU) logic might result in the continuous creation of duplicate internal mounts to the upgrade image store location when upgrade status checks are exercised via API. The affected node will eventually panic unless a workaround is performed to stop the creation of these mounts.

Issue Description

A defect in ONTAP Automated Non-Disruptive Upgrade (ANDU) logic might result in the creation of duplicate internal mounts to the upgrade image store location when upgrade status checks are exercised via API.

The affected node will eventually panic after the creation of thousands of duplicate mounts unless a workaround is performed to stop the creation of these mounts.

This issue might be seen on configurations where all of the following apply:

  1. Clustered configurations with 2 or more nodes
  2. Upgraded to the following releases via ANDU:
    • 9.11.1P8 - 9.11.1P10
    • 9.12.1P2 - 9.12.1P6
    • 9.13.1RC1 - 9.13.1P1
    • 9.14.0
  3. A monitoring application such as BlueXP frequently sends one of the following commands to the affected cluster:
    • system-image-get-iter (ONTAPI)
    • GET /api/cluster/software (REST API)
    • cluster image show-update-progress (CLI)

This issue is not guaranteed to occur with above configuration, but if increasing mounts are seen, the workaround must be executed to stop accumulation of mounts as systems with a high number of mountpoints have experienced panics.

This issue is addressed by two fixes:

1574557 - Prevents the build up of internal mounts due to incoming upgrade status APIs.

1566466 - Addresses the panic due to a high volume of internal mounts. This is extremly unlikely to occur for systems with the fix for 1574557.

Symptom

The initial symptom can be seen by typing the following command (requires unlocked "Diag" user account):

clustername::> set -priv diag

Warning: These diagnostic commands are for use by NetApp personnel only.
Do you want to continue? {y|n}: y

clustername::*> system node systemshell -node nodename  -command "df -ik"

The bottom of the command output will show many instances of this duplicate mount:

/mroot/etc/NDU/mnt/store

This is also viewable in the Digital Advisor \ Storage Health \ AutoSupport by viewing the BSD-DF-I-K section of this message type: HA Group Notification (WEEKLY_LOG) NOTICE

If the count of duplicate mounts is increasing there may be other noted symptoms and behavior as the command output becomes increasingly large.

Another symptom will be a very large /mroot/etc/log/mlog/mgwd.log with many recurring instances of "INFO: NDU::mgwd: MountPackageByVersion" days or weeks after the actual upgrade.

Eventually the system may panic. Panic messages might include "Process <process name> unresponsive" or "thread (sysctl) on cpu <##> hung". After the panic, the mounts will be cleaned up, although they may continue to accumulate if the workaround is not performed.

Workaround

  • Delete the staged cluster image package after the upgrade is complete. This will stop additional mounts from accumulating due to the APIs.

cluster::> cluster image package show

Package Version Package Build Time

-----------------------------------

9.12.1P2 4/5/2023 14:58:46

cluster::> cluster image package delete -version 9.12.1P2

  • Optionally: After deleting the cluster staged image package, if the number of mounts are causing impact with AutoSupport generation (which includes two command outputs which list all of the mounts and require increasing amounts of memory allocation to collect), excessive mounts can be removed via a takeover and giveback of the affected node.
    • This is strongly advised for configurations with thousands of duplicate mounts.
  • NetApp Digital Advisor provides a High Impact risk that depends on the BSD-DF-I-K section of AutoSupport where count of /mroot/etc/NDU/mnt/store is >200. This means the risk will continue to be reported by Digital Advisor even after the staged cluster image package is deleted. If not planning to clear the duplicate mounts, the risk can be resolved by selecting "Acknowledge" and "Workaround Applied" options.

Solution

The change that caused the excessive internal duplicate mounts has been removed from ONTAP by the fix for 1574557.

Upgrade to the following releases to avoid impact from this issue:

  • 9.11.1P11 or later
  • 9.12.1P7 (pending availability)
  • 9.13.1P2 (pending availability)
  • 9.14.0P1 (pending availability).

Upgrading to these releases will:

  1. Prevent occurrence/recurrence of the buildup of internal duplicate mounts.
  2. The upgrade process will clear existing internal duplicate mounts, if upgrading from an impacted release.

While these releases do not address the panic due to thousands of internal mounts, described by 1566466, they prevent the cause of the panic and effectively protect against this issue.

Additional Information