[LU-14499] o2iblnd: LU-13368 changes cause shutdown procedure to not complete Created: 08/Mar/21  Updated: 30/Jan/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Serguei Smirnov Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13638 lnet: discard the callback Open
is related to LU-13368 lnet may be trying to use deleted rou... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Changes applied by the patches from LU-13368 appear to be causing the o2iblnd shutdown procedure to not complete properly sometimes on lustre_rmmod:

In that case, messages similar to the following keep showing up in the log:

[51025.354675] LNet: 9402:0:(o2iblnd.c:3107:kiblnd_shutdown()) 10.1.11.124@o2ib10: waiting for 3 peers to disconnect
[51029.354481] LNet: 9402:0:(o2iblnd.c:3107:kiblnd_shutdown()) 10.1.11.124@o2ib10: waiting for 3 peers to disconnect
[51037.353971] LNet: 9402:0:(o2iblnd.c:3107:kiblnd_shutdown()) 10.1.11.124@o2ib10: waiting for 3 peers to disconnect

 



 Comments   
Comment by Gerrit Updater [ 08/Mar/21 ]

Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41937
Subject: LU-14499 lnet: Revert "LU-13368 lnet: discard the callback"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: eda619ef352141de76b3bc2fe97d56ed68c7c9d9

Comment by Chris Horn [ 14/Feb/22 ]

ssmirnov could this issue impact ksocklnd as well?

Comment by Serguei Smirnov [ 14/Feb/22 ]

Chris,

Despite concluding that LU-13368 patch is causing the issue, I never had complete understanding of what was going wrong exactly. It could be specific to o2iblnd only. I don't think I recall socklnd getting stuck in the same manner. 

Comment by Chris Horn [ 22/Aug/22 ]

We traced a memory leak back to the LU-13368 change. Given Alexey's prior misgivings about the patch, and its known bugginess, I think we should revert it.

Comment by Olaf Faaland [ 12/Jan/23 ]

Hi Serguei,
Is this stuck because you need more information?
thanks,

Comment by Serguei Smirnov [ 12/Jan/23 ]

Hi Olaf,

From comments in LU-13368 and this ticket, it looks like "lnet: discard the callback" change should be reverted. On the other hand, there were potential fixes supplied by Yang Sheng which didn't get tested. If I remember correctly, this got stuck pending the test results, which would help decide whether to revert the change, or keep it and add the fixes. 

ys, sihara: is my understanding correct? 

Thanks,

Serguei. 

Comment by Yang Sheng [ 13/Jan/23 ]

Hi, Serguei, Yes, you are right.

Comment by Olaf Faaland [ 13/Jan/23 ]

What are the gerrit URLs for those changes? Thanks.

Comment by Serguei Smirnov [ 16/Jan/23 ]

Hi Olaf,

ys will correct me if I'm wrong, but I believe these are the two changes which are supposed to be fixing the original "discard the callback":

https://review.whamcloud.com/#/c/fs/lustre-release/+/40937/

https://review.whamcloud.com/#/c/fs/lustre-release/+/41970/

Thanks,

Serguei.

 

Comment by Yang Sheng [ 17/Jan/23 ]

Sorry for the delay. Yes, Serguei is right.
The https://review.whamcloud.com/#/c/fs/lustre-release/+/38845/ is original patch.
The https://review.whamcloud.com/#/c/fs/lustre-release/+/40937/ is a patch to work with 38845 to provide full function.
The https://review.whamcloud.com/#/c/fs/lustre-release/+/41970/ is a bug fixing patch for this ticket. Since i think it should be tested first, So i mark it as a 'test patch'.

Comment by Olaf Faaland [ 23/Jan/23 ]

Hi Serguei and Yang Sheng,

Thanks for clarifying. It looks like changes 40937 and 41970 aren't progressing. Are you waiting on something?

Thanks

Comment by Serguei Smirnov [ 30/Jan/23 ]

Hi Olaf, the problem here appears to be that even though the patches are code-complete and Maloo-tested, we're not able to verify Yang Sheng's fixes in a proper IB environment as Shuichi doesn't have the available resources. Would you be able to give these patches a try on your system?

 

Generated at Sat Feb 10 03:10:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.