[LU-8249] Potential deadlock in lnet Created: 07/Jun/16  Updated: 05/Aug/20  Resolved: 29/Aug/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: CEA Assignee: Doug Oucharek (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Attachments: HTML File partial_stack_lnet_deadlock    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While testing a patch I was able to deadlock all the CPU of my machine.
After investigating the log reports I found out that in "LNetMDAttach" (implemented in lnet/lnet/lib-md.c) there is a call to the "vfree" function in interrupt context which is illegal in linux kernel versions prior to 3.10.

I cannot be sure whether or not this was actually the cause of my problem though, because I have not been able to reproduce since then.



 Comments   
Comment by Gerrit Updater [ 07/Jun/16 ]

Quentin Bouget (quentin.bouget.ocre@cea.fr) uploaded a new patch: http://review.whamcloud.com/20676
Subject: LU-8249 lnet: potential deadlock in lnet
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ef1052386265b73c129c2dc5d20bd86e175a8c0d

Comment by Bruno Faccini (Inactive) [ 08/Jun/16 ]

Hello Quentin, could you provide additional infos to help us to understand the problem ?? Like the log reports you refer too, and may be a crash-dump is available too ??

Comment by Joseph Gmitter (Inactive) [ 08/Jun/16 ]

Hi Doug,

Can you please investigate this patch?

Thanks.
Joe

Comment by Quentin Bouget [ 09/Jun/16 ]

Hello Bruno, I actually do not have much, I did not realize what this was at the time. I thought it was related to my patch and would be easy to reproduce. The best I have is a partial calling stack incriminating "cfs_percpt_lock" (attachment file). I reviewed the code that used it and found the inverted "vfree" and "cfs_percpt_unlock" in "LNetMDAttach".

Comment by Bruno Faccini (Inactive) [ 06/Jul/16 ]

Hello Quentin, even if I don't think it will fix the specific dead-lock situation you have encountered, I am not against your patch that sounds ok to comply to the rule against calling vfree() with a spin-lock currently granted, but then I think you should at least modify LNetMDBind() the same way and there also not to try to lnet_res_unlock() upon lnet_res_lock_current() failure.

Also, you may have triggered an issue similar to the one tracked in LU-8334, in fact we will need a crash-dump taken upon next occurrence to be able to determine the real root cause.

What do you think?

Comment by Quentin Bouget [ 06/Jul/16 ]

Indeed I missed the one in LnetMDBind().
I'm not sure lnet_res_lock_current() can actually fail though, the return code that is tested after it is set by lnet_md_build() and I don't think there is any actual reason to take the spinlock upon failure (if it is to release it right after that).

If this ever occurs again I will make sure to get a crash-dump.

(New patch is available)

Comment by Bruno Faccini (Inactive) [ 07/Jul/16 ]

Sorry, I meant for failures before lnet_res_lock_current(), but I see that this what you also implemented in the new patch-set #4 !

Comment by Gerrit Updater [ 29/Aug/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20676/
Subject: LU-8249 lnet: potential deadlock in lnet
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cc025e667464672edd25da819e106854c220e668

Comment by Peter Jones [ 29/Aug/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:15:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.