SU538: [Impact: High] System disruption from excessive internal mount points after upgrade

Views:: 425

Last Updated:: 8/25/2023, 3:01:40 AM

免责声明：
联想凌拓科技有限公司（“Lenovo NetApp”）不对本页面中提供的任何信息或建议的准确性、可靠性或可维护性，或通过使用这些信息或遵守本文中提供的建议可能获得的任何结果，提供任何陈述或保证。本页面中的信息是按原样分发的，使用这些信息或实施本文中的任何建议或技术是客户的责任，取决于客户评估这些信息并将其整合到客户的运营环境中的能力。本页面及其包含的信息只能与本页面中讨论的 NetApp 产品结合使用。在任何情况下，Lenovo NetApp 均不承担因与使用或执行本页面上提供的信息有关的或导致的任何特殊的、间接的或随之而来的任何损失，或者因使用、数据或利润损失（无论是否在合同履行中）、疏忽或其它侵权行为导致的任何损失。

更多最新信息请参考 NetApp 官网支持公告

Summary

[Impact High: Node disruption]

A defect in ONTAP Automated Non-Disruptive Upgrade (ANDU) logic might result in the continuous creation of duplicate internal mounts to the upgrade image store location when upgrade status checks are exercised via API. The affected node will eventually panic unless a workaround is performed to stop the creation of these mounts.

Issue Description

A defect in ONTAP Automated Non-Disruptive Upgrade (ANDU) logic might result in the creation of duplicate internal mounts to the upgrade image store location when upgrade status checks are exercised via API.

The affected node will eventually panic after the creation of thousands of duplicate mounts unless a workaround is performed to stop the creation of these mounts.

This issue might be seen on configurations where all of the following apply:

Clustered configurations with 2 or more nodes
Upgraded to the following releases via ANDU:
- 9.11.1P8 - 9.11.1P10
- 9.12.1P2 - 9.12.1P6
- 9.13.1RC1 - 9.13.1P1
- 9.14.0
A monitoring application such as BlueXP frequently sends one of the following commands to the affected cluster:
- system-image-get-iter (ONTAPI)
- GET /api/cluster/software (REST API)
- cluster image show-update-progress (CLI)

This issue is not guaranteed to occur with above configuration, but if increasing mounts are seen, the workaround must be executed to stop accumulation of mounts as systems with a high number of mountpoints have experienced panics.

This issue is addressed by two fixes:

1574557 - Prevents the build up of internal mounts due to incoming upgrade status APIs.

1566466 - Addresses the panic due to a high volume of internal mounts. This is extremly unlikely to occur for systems with the fix for 1574557.

Symptom

The initial symptom can be seen by typing the following command (requires unlocked "Diag" user account):

clustername::> set -priv diag

Warning: These diagnostic commands are for use by NetApp personnel only. Do you want to continue? {y|n}: y

clustername::*> system node systemshell -node nodename -command "df -ik"

The bottom of the command output will show many instances of this duplicate mount:

/mroot/etc/NDU/mnt/store

This is also viewable in the Digital Advisor \ Storage Health \ AutoSupport by viewing the BSD-DF-I-K section of this message type: HA Group Notification (WEEKLY_LOG) NOTICE

If the count of duplicate mounts is increasing there may be other noted symptoms and behavior as the command output becomes increasingly large.

Another symptom will be a very large /mroot/etc/log/mlog/mgwd.log with many recurring instances of "INFO: NDU::mgwd: MountPackageByVersion" days or weeks after the actual upgrade.

Eventually the system may panic. Panic messages might include "Process <process name> unresponsive" or "thread (sysctl) on cpu <##> hung". After the panic, the mounts will be cleaned up, although they may continue to accumulate if the workaround is not performed.

Workaround

Delete the staged cluster image package after the upgrade is complete. This will stop additional mounts from accumulating due to the APIs.

cluster::> cluster image package show

Package Version Package Build Time

-----------------------------------

9.12.1P2 4/5/2023 14:58:46

cluster::> cluster image package delete -version 9.12.1P2

Optionally: After deleting the cluster staged image package, if the number of mounts are causing impact with AutoSupport generation (which includes two command outputs which list all of the mounts and require increasing amounts of memory allocation to collect), excessive mounts can be removed via a takeover and giveback of the affected node.
- This is strongly advised for configurations with thousands of duplicate mounts.
NetApp Digital Advisor provides a High Impact risk that depends on the BSD-DF-I-K section of AutoSupport where count of /mroot/etc/NDU/mnt/store is >200. This means the risk will continue to be reported by Digital Advisor even after the staged cluster image package is deleted. If not planning to clear the duplicate mounts, the risk can be resolved by selecting "Acknowledge" and "Workaround Applied" options.

Solution

The change that caused the excessive internal duplicate mounts has been removed from ONTAP by the fix for 1574557.

Upgrade to the following releases to avoid impact from this issue:

9.11.1P11 or later
9.12.1P7 (pending availability)
9.13.1P2 (pending availability)
9.14.0P1 (pending availability).

Upgrading to these releases will:

Prevent occurrence/recurrence of the buildup of internal duplicate mounts.
The upgrade process will clear existing internal duplicate mounts, if upgrading from an impacted release.

While these releases do not address the panic due to thousands of internal mounts, described by 1566466, they prevent the cause of the panic and effectively protect against this issue.

Additional Information

Bug 1566466
Bug 1574557

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.