SU519: [Impact Critical] Repeated ktlsd failures impacting object store, FabricPool, and CBS after upgrade to ONTAP 9.11.1

Views:
2,121
Last Updated:
10/22/2022, 2:17:30 AM

收藏

Summary

[Impact Critical: Potential Data Unavailability]

NetApp has recently identified an issue with ONTAP version 9.11.1 (9.11.1RC1, 9.11.1, 9.11.1P1, 9.11.1P2). The issue that was found impacts the access and availability of object storage used by Cloud Volumes ONTAP and can potentially interfere with the tiering and backup mechanisms.

As a result, NetApp decided to temporarily prevent additional installations of this version in cloud-based environments until a solution was made available The intent of this action was to minimize the potential impact to customers and the need to subsequently apply corrective actions should additional customers be impacted.

After extensive development and QA activity, a solution for this issue has been made available in the ONTAP 9.11.1P3 release which was published to the NetApp Support Site on October 14, 2022, and which was made available in the Azure, AWS and GCP clouds as of October 20, 2022.

Customers running a version of ONTAP 9.11.1 prior to the 9.11.1P3 release are strongly advised to upgrade to the 9.11.1P3 release at their earliest convenience.

Issue Description

A kernel Transport Layer Security (kTLS) issue can manifest itself in several different ways. It can impact FabricPool tiering, Cloud Backup Services, or SnapMirror, and can result in:

  • FabricPool object store unavailability (OSC)
  • Failure to create tiering-enabled aggregates in Cloud Manager
  • Failures with SnapMirror CVO to CVO relationships
  • Failures with Cloud Backup Service (CBS, from on-prem or from CVO) when backing up to cloud-based object stores
  • Multi-Disk Panic (MDP) disruptions in Azure HA deployments due to connection problems with the Root Storage Account
  • "Cloud tier is not available" reported in the Tiering dashboard in Cloud Manager
  • Failure of cluster peer relationships that use encryption (may flip intermittently from Partial to Available, or may fail completely, and may impact intercluster SnapMirror)

Impact is more likely to be seen in Cloud Volumes ONTAP deployments, but impact may be observed on physical storage appliances, especially when communicating to cloud-based object storage.

Symptom

In many of the failure scenarios, the following may be seen repeatedly (every 10 minutes) in the EMS log.

Thu Sep 15 18:07:32 +0000 [Cluster-01: ktlsd: ktls.failed:notice]: "The TLS connections have failed several times with remote host '##.##.##.##' in IPspace '###', for which the latest reason given is: OpenSSL: error:7E000003:lib(252)::reason(3)."

  • Object Store is marked as unavailable

::*> storage aggregate object-store show
    Aggregate      Object Store Name Availability   Mirror Type
    -------------- ----------------- -------------  -----------
    aggr01        StorageAccount   unavailable    primary
    aggr02        StorageAccount   unavailable    primary
    2 entries were displayed.

  • In case of CVO, the creation of a tiering-enabled aggregate via Cloud Manager may fail with one of these errors:

Error:Cannot verify availability of the object store from node <Nodename>. Reason: OpenSSL: in function func(0): reason(3).

Error:Cannot verify availability of the object store from node <Nodename>. Reason: Wrong port or server is not reachable.

  • CVO HA systems deployed in Azure are much more likely to experience MDP due to connection problems to the Root Storage Account on 9.11.1.

PANIC: DIAGNOSTIC PANIC Disk deleted or missing on cloud shared HA in SK process config_thread on release 9.11.1 (C)

  • The Tiering dashboard in Cloud Manager may display "Cloud tier is not available".

For other symptoms, please refer to the NetApp knowledgebase articles linked in the "Additional Information" section of this bulletin.

Workaround

There is no workaround that prevents the issue.

Solution

A solution that addresses the issues documented in this bulletin has been identified and after extensive development and QA effort has been released.

The ONTAP version that delivers this solution is ONTAP 9.11.1P3. The release has been published to the NetApp Support Site for on-premise deployment (10/14/2022),

In addition, the 9.11.1P3 release was made available in the Azure, AWS and GCP clouds as of October 20, 2022.

Customers running a version of ONTAP 9.11.1 prior to the 9.11.1P3 release are strongly advised to upgrade to the 9.11.1P3 release at their earliest convenience.

Note: If currently experiencing symptoms (such as object store unavailability) due to this issue, it is recommended that a takeover/giveback operation be executed before beginning the upgrade to a release where this issue is fixed.

Additional Information

For more information, see BUG 1494466.

In addition, the following KB articles might be helpful when encountering these issues: