SU519: [Impact Critical] Repeated ktlsd failures impacting object store, FabricPool, and CBS after upgrade to ONTAP 9.11.1
- Views:
- 2,121
- Last Updated:
- 10/22/2022, 2:17:30 AM
收藏
Summary
[Impact Critical: Potential Data Unavailability]
NetApp has recently identified an issue with ONTAP version 9.11.1 (9.11.1RC1, 9.11.1, 9.11.1P1, 9.11.1P2). The issue that was found impacts the access and availability of object storage used by Cloud Volumes ONTAP and can potentially interfere with the tiering and backup mechanisms.
As a result, NetApp decided to temporarily prevent additional installations of this version in cloud-based environments until a solution was made available The intent of this action was to minimize the potential impact to customers and the need to subsequently apply corrective actions should additional customers be impacted.
After extensive development and QA activity, a solution for this issue has been made available in the ONTAP 9.11.1P3 release which was published to the NetApp Support Site on October 14, 2022, and which was made available in the Azure, AWS and GCP clouds as of October 20, 2022.
Customers running a version of ONTAP 9.11.1 prior to the 9.11.1P3 release are strongly advised to upgrade to the 9.11.1P3 release at their earliest convenience.
Issue Description
A kernel Transport Layer Security (kTLS) issue can manifest itself in several different ways. It can impact FabricPool tiering, Cloud Backup Services, or SnapMirror, and can result in:
- FabricPool object store unavailability (OSC)
- Failure to create tiering-enabled aggregates in Cloud Manager
- Failures with SnapMirror CVO to CVO relationships
- Failures with Cloud Backup Service (CBS, from on-prem or from CVO) when backing up to cloud-based object stores
- Multi-Disk Panic (MDP) disruptions in Azure HA deployments due to connection problems with the Root Storage Account
- "Cloud tier is not available" reported in the Tiering dashboard in Cloud Manager
- Failure of cluster peer relationships that use encryption (may flip intermittently from Partial to Available, or may fail completely, and may impact intercluster SnapMirror)
Impact is more likely to be seen in Cloud Volumes ONTAP deployments, but impact may be observed on physical storage appliances, especially when communicating to cloud-based object storage.
Symptom
In many of the failure scenarios, the following may be seen repeatedly (every 10 minutes) in the EMS log.
Thu Sep 15 18:07:32 +0000 [Cluster-01: ktlsd: ktls.failed:notice]: "The TLS connections have failed several times with remote host '##.##.##.##' in IPspace '###', for which the latest reason given is: OpenSSL: error:7E000003:lib(252)::reason(3)."
- Object Store is marked as unavailable
::*> storage aggregate object-store show
Aggregate Object Store Name Availability Mirror Type
-------------- ----------------- ------------- -----------
aggr01 StorageAccount unavailable primary
aggr02 StorageAccount unavailable primary
2 entries were displayed.
- In case of CVO, the creation of a tiering-enabled aggregate via Cloud Manager may fail with one of these errors:
Error:Cannot verify availability of the object store from node <Nodename>. Reason: OpenSSL: in function func(0): reason(3).
Error:Cannot verify availability of the object store from node <Nodename>. Reason: Wrong port or server is not reachable.
- CVO HA systems deployed in Azure are much more likely to experience MDP due to connection problems to the Root Storage Account on 9.11.1.
PANIC: DIAGNOSTIC PANIC Disk deleted or missing on cloud shared HA in SK process config_thread on release 9.11.1 (C)
- The Tiering dashboard in Cloud Manager may display "Cloud tier is not available".
For other symptoms, please refer to the NetApp knowledgebase articles linked in the "Additional Information" section of this bulletin.
Workaround
There is no workaround that prevents the issue.
Solution
A solution that addresses the issues documented in this bulletin has been identified and after extensive development and QA effort has been released.
The ONTAP version that delivers this solution is ONTAP 9.11.1P3. The release has been published to the NetApp Support Site for on-premise deployment (10/14/2022),
In addition, the 9.11.1P3 release was made available in the Azure, AWS and GCP clouds as of October 20, 2022.
Customers running a version of ONTAP 9.11.1 prior to the 9.11.1P3 release are strongly advised to upgrade to the 9.11.1P3 release at their earliest convenience.
Note: If currently experiencing symptoms (such as object store unavailability) due to this issue, it is recommended that a takeover/giveback operation be executed before beginning the upgrade to a release where this issue is fixed.
Additional Information
For more information, see BUG 1494466.
In addition, the following KB articles might be helpful when encountering these issues:
联想凌拓科技有限公司(“Lenovo NetApp”)不对本页面中提供的任何信息或建议的准确性、可靠性或可维护性,或通过使用这些信息或遵守本文中提供的建议可能获得的任何结果,提供任何陈述或保证。本页面中的信息是按原样分发的,使用这些信息或实施本文中的任何建议或技术是客户的责任,取决于客户评估这些信息并将其整合到客户的运营环境中的能力。本页面及其包含的信息只能与本页面中讨论的 NetApp 产品结合使用。在任何情况下,Lenovo NetApp 均不承担因与使用或执行本页面上提供的信息有关的或导致的任何特殊的、间接的或随之而来的任何损失,或者因使用、数据或利润损失(无论是否在合同履行中)、疏忽或其它侵权行为导致的任何损失。
更多最新信息请参考 NetApp 官网支持公告