[LU-8249] Potential deadlock in lnet Created: 07/Jun/16 Updated: 05/Aug/20 Resolved: 29/Aug/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | CEA | Assignee: | Doug Oucharek (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
While testing a patch I was able to deadlock all the CPU of my machine. I cannot be sure whether or not this was actually the cause of my problem though, because I have not been able to reproduce since then. |
| Comments |
| Comment by Gerrit Updater [ 07/Jun/16 ] |
|
Quentin Bouget (quentin.bouget.ocre@cea.fr) uploaded a new patch: http://review.whamcloud.com/20676 |
| Comment by Bruno Faccini (Inactive) [ 08/Jun/16 ] |
|
Hello Quentin, could you provide additional infos to help us to understand the problem ?? Like the log reports you refer too, and may be a crash-dump is available too ?? |
| Comment by Joseph Gmitter (Inactive) [ 08/Jun/16 ] |
|
Hi Doug, Can you please investigate this patch? Thanks. |
| Comment by Quentin Bouget [ 09/Jun/16 ] |
|
Hello Bruno, I actually do not have much, I did not realize what this was at the time. I thought it was related to my patch and would be easy to reproduce. The best I have is a partial calling stack incriminating "cfs_percpt_lock" (attachment file). I reviewed the code that used it and found the inverted "vfree" and "cfs_percpt_unlock" in "LNetMDAttach". |
| Comment by Bruno Faccini (Inactive) [ 06/Jul/16 ] |
|
Hello Quentin, even if I don't think it will fix the specific dead-lock situation you have encountered, I am not against your patch that sounds ok to comply to the rule against calling vfree() with a spin-lock currently granted, but then I think you should at least modify LNetMDBind() the same way and there also not to try to lnet_res_unlock() upon lnet_res_lock_current() failure. Also, you may have triggered an issue similar to the one tracked in What do you think? |
| Comment by Quentin Bouget [ 06/Jul/16 ] |
|
Indeed I missed the one in LnetMDBind(). If this ever occurs again I will make sure to get a crash-dump. (New patch is available) |
| Comment by Bruno Faccini (Inactive) [ 07/Jul/16 ] |
|
Sorry, I meant for failures before lnet_res_lock_current(), but I see that this what you also implemented in the new patch-set #4 ! |
| Comment by Gerrit Updater [ 29/Aug/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20676/ |
| Comment by Peter Jones [ 29/Aug/16 ] |
|
Landed for 2.9 |