Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • None
    • 3
    • 9223372036854775807

    Description

      While testing a patch I was able to deadlock all the CPU of my machine.
      After investigating the log reports I found out that in "LNetMDAttach" (implemented in lnet/lnet/lib-md.c) there is a call to the "vfree" function in interrupt context which is illegal in linux kernel versions prior to 3.10.

      I cannot be sure whether or not this was actually the cause of my problem though, because I have not been able to reproduce since then.

      Attachments

        Activity

          [LU-8249] Potential deadlock in lnet
          pjones Peter Jones added a comment -

          Landed for 2.9

          pjones Peter Jones added a comment - Landed for 2.9

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20676/
          Subject: LU-8249 lnet: potential deadlock in lnet
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: cc025e667464672edd25da819e106854c220e668

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20676/ Subject: LU-8249 lnet: potential deadlock in lnet Project: fs/lustre-release Branch: master Current Patch Set: Commit: cc025e667464672edd25da819e106854c220e668

          Sorry, I meant for failures before lnet_res_lock_current(), but I see that this what you also implemented in the new patch-set #4 !

          bfaccini Bruno Faccini (Inactive) added a comment - Sorry, I meant for failures before lnet_res_lock_current(), but I see that this what you also implemented in the new patch-set #4 !

          Indeed I missed the one in LnetMDBind().
          I'm not sure lnet_res_lock_current() can actually fail though, the return code that is tested after it is set by lnet_md_build() and I don't think there is any actual reason to take the spinlock upon failure (if it is to release it right after that).

          If this ever occurs again I will make sure to get a crash-dump.

          (New patch is available)

          bougetq Quentin Bouget (Inactive) added a comment - Indeed I missed the one in LnetMDBind(). I'm not sure lnet_res_lock_current() can actually fail though, the return code that is tested after it is set by lnet_md_build() and I don't think there is any actual reason to take the spinlock upon failure (if it is to release it right after that). If this ever occurs again I will make sure to get a crash-dump. (New patch is available)
          bfaccini Bruno Faccini (Inactive) added a comment - - edited

          Hello Quentin, even if I don't think it will fix the specific dead-lock situation you have encountered, I am not against your patch that sounds ok to comply to the rule against calling vfree() with a spin-lock currently granted, but then I think you should at least modify LNetMDBind() the same way and there also not to try to lnet_res_unlock() upon lnet_res_lock_current() failure.

          Also, you may have triggered an issue similar to the one tracked in LU-8334, in fact we will need a crash-dump taken upon next occurrence to be able to determine the real root cause.

          What do you think?

          bfaccini Bruno Faccini (Inactive) added a comment - - edited Hello Quentin, even if I don't think it will fix the specific dead-lock situation you have encountered, I am not against your patch that sounds ok to comply to the rule against calling vfree() with a spin-lock currently granted, but then I think you should at least modify LNetMDBind() the same way and there also not to try to lnet_res_unlock() upon lnet_res_lock_current() failure. Also, you may have triggered an issue similar to the one tracked in LU-8334 , in fact we will need a crash-dump taken upon next occurrence to be able to determine the real root cause. What do you think?
          bougetq Quentin Bouget (Inactive) added a comment - - edited

          Hello Bruno, I actually do not have much, I did not realize what this was at the time. I thought it was related to my patch and would be easy to reproduce. The best I have is a partial calling stack incriminating "cfs_percpt_lock" (attachment file). I reviewed the code that used it and found the inverted "vfree" and "cfs_percpt_unlock" in "LNetMDAttach".

          bougetq Quentin Bouget (Inactive) added a comment - - edited Hello Bruno, I actually do not have much, I did not realize what this was at the time. I thought it was related to my patch and would be easy to reproduce. The best I have is a partial calling stack incriminating "cfs_percpt_lock" (attachment file). I reviewed the code that used it and found the inverted "vfree" and "cfs_percpt_unlock" in "LNetMDAttach".

          Hi Doug,

          Can you please investigate this patch?

          Thanks.
          Joe

          jgmitter Joseph Gmitter (Inactive) added a comment - Hi Doug, Can you please investigate this patch? Thanks. Joe

          Hello Quentin, could you provide additional infos to help us to understand the problem ?? Like the log reports you refer too, and may be a crash-dump is available too ??

          bfaccini Bruno Faccini (Inactive) added a comment - Hello Quentin, could you provide additional infos to help us to understand the problem ?? Like the log reports you refer too, and may be a crash-dump is available too ??

          Quentin Bouget (quentin.bouget.ocre@cea.fr) uploaded a new patch: http://review.whamcloud.com/20676
          Subject: LU-8249 lnet: potential deadlock in lnet
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: ef1052386265b73c129c2dc5d20bd86e175a8c0d

          gerrit Gerrit Updater added a comment - Quentin Bouget (quentin.bouget.ocre@cea.fr) uploaded a new patch: http://review.whamcloud.com/20676 Subject: LU-8249 lnet: potential deadlock in lnet Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ef1052386265b73c129c2dc5d20bd86e175a8c0d

          People

            doug Doug Oucharek (Inactive)
            cealustre CEA
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: