[LU-16949] LNet: deadlock on o2ib NI going down under Centos 7.9 Created: 07/Jul/23  Updated: 22/Jan/24  Resolved: 08/Sep/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Serguei Smirnov Assignee: Serguei Smirnov
Resolution: Fixed Votes: 0
Labels: lnet, o2iblnd
Environment:

centos 7.9 VM 3.10.0-1160.25.1.el7_lustre.x86_64 kernel
could not reproduce on centos 8.2


Issue Links:
Duplicate
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The issue can be reproduced by adding an o2ib NI and then interrupting the corresponding link by pulling the cable or shutting down the switch connection or the whole switch. 

Alternatively, one can add the o2ib NI when the corresponding link is already down (cable pulled) to the same effect.

Using "ifdown" to bring the whole interface down doesn't reproduce the problem. 

I could reproduce this on a Centos 7.9 VM, but not on a Centos 8.2 system.

The issue got introduced by 

commit da230373bd14306cb97fb48748ebce205f09d468
Author: Serguei Smirnov <ssmirnov@whamcloud.com>
Date:   Thu Feb 16 10:34:03 2023 -0800
LU-16563 lnet: use discovered ni status to set initial health 

It then got masked by another issue causing failure when trying to add an o2ib NI starting from 

commit cc5594df3e70d1924f34ccdf4c3178654d277ad0
Author: Shaun Tancheff <shaun.tancheff@hpe.com>
Date:   Sun Apr 23 07:19:11 2023 -0500
LU-16759 o2ib: MOFED 5.5+ ib_dma_virt_map_sg

until some later commit which I didn't determine re-enabled adding o2iblnd NI. The latest master is behaving on 7.9 Centos as described.



 Comments   
Comment by Gerrit Updater [ 11/Jul/23 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51635
Subject: LU-16949 lnet: get monitor thread to update ping buffer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4522d02f3962130bab89a29c1fd8c393ba412faf

Comment by Gerrit Updater [ 07/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51635/
Subject: LU-16949 lnet: get monitor thread to update ping buffer
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7ac399c5aec01186ad4c9a7153aea400777c897f

Comment by Peter Jones [ 08/Sep/23 ]

landed for 2.16

Generated at Sat Feb 10 03:31:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.