[LU-17325] o2iblnd: graceful handling of CM_EVENT_UNREACHABLE on established connection Created: 30/Nov/23  Updated: 08/Jan/24  Resolved: 20/Dec/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Serguei Smirnov Assignee: Serguei Smirnov
Resolution: Fixed Votes: 0
Labels: RoCE, o2iblnd

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

There were examples in the field with RoCE setups which demonstrate that CM_EVENT_UNREACHABLE may be received when connection is already in ESTABLISHED state.

This causes the assertion in kiblnd_cm_callback() to fail:

 ASSERTION( conn->ibc_state != 3 && conn->ibc_state != 0 ) failed:

It is proposed to handle this in a more gracious manner:  report the event as unexpected and allow the flow to continue. If there are indeed issues on the connection, it is expected to report transaction errors and get cleaned up without crashing the whole system.



 Comments   
Comment by Gerrit Updater [ 30/Nov/23 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53298
Subject: LU-17325 o2iblnd: CM_EVENT_UNREACHABLE on established conn
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cbde71bf893dba0de752a190c3b16d653ef75085

Comment by Gerrit Updater [ 20/Dec/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53298/
Subject: LU-17325 o2iblnd: CM_EVENT_UNREACHABLE on established conn
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f7051f0092b19416ed86d7f4bbfe1cba7bb74c02

Comment by Peter Jones [ 20/Dec/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:34:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.