Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.5.3
    • bull lustre 2.5.3.90
    • 3
    • 9223372036854775807

    Description

      We hit several times on different OSS what looks like a race condition on freeing an lnet_msg.
      The crash looks as follows:

      LustreError: 25277:0:(lu_object.c:1463:key_fini()) ASSERTION( atomic_read(&key->lct_used) > 1 ) failed:
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      IP: [<ffffffffa0602924>] lnet_ptl_match_md+0x3b4/0x870 [lnet]
      PGD 0
      Oops: 0002 1 SMP

      with this stack:

      crash> bt
      PID: 2163   TASK: ffff8803b9df2040  CPU: 25  COMMAND: "kiblnd_sd_01_02"
       #0 [ffff880340ceb7e0] machine_kexec at ffffffff8103d30b
       #1 [ffff880340ceb840] crash_kexec at ffffffff810cc4f2
       #2 [ffff880340ceb910] oops_end at ffffffff8153d3d0
       #3 [ffff880340ceb940] no_context at ffffffff8104e8cb
       #4 [ffff880340ceb990] __bad_area_nosemaphore at ffffffff8104eb55
       #5 [ffff880340ceb9e0] bad_area_nosemaphore at ffffffff8104ec23
       #6 [ffff880340ceb9f0] __do_page_fault at ffffffff8104f31c
       #7 [ffff880340cebb10] do_page_fault at ffffffff8153f31e
       #8 [ffff880340cebb40] page_fault at ffffffff8153c6c5
          [exception RIP: lnet_ptl_match_md+948]
          RIP: ffffffffa0602924  RSP: ffff880340cebbf0  RFLAGS: 00010202
          RAX: 0000000000000000  RBX: ffff880340cebcf0  RCX: ffff880c75631940
          RDX: 0000000000000000  RSI: 0000000000000001  RDI: ffff880bc4ed4550
          RBP: ffff880340cebc70   R8: ffff880bc4ed4550   R9: a500000000000000
          R10: 0000000000000000  R11: 0000000000000000  R12: ffff880749841800
          R13: 0000000000000002  R14: ffff881078f8b2c0  R15: 0000000000000002
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #9 [ffff880340cebc78] lnet_parse at ffffffffa0609fb3 [lnet]
      #10 [ffff880340cebd58] kiblnd_handle_rx at ffffffffa09f19db [ko2iblnd]
      #11 [ffff880340cebda8] kiblnd_rx_complete at ffffffffa09f26c3 [ko2iblnd]
      #12 [ffff880340cebdf8] kiblnd_complete at ffffffffa09f2872 [ko2iblnd]
      #13 [ffff880340cebe08] kiblnd_scheduler at ffffffffa09f2c2a [ko2iblnd]
      #14 [ffff880340cebee8] kthread at ffffffff810a101e
      #15 [ffff880340cebf48] kernel_thread at ffffffff8100c28a

      The crash seems to occur here:
      lnet_ptl_match_delay(struct lnet_portal *ptl,
                   struct lnet_match_info *info, struct lnet_msg *msg)
      {
      ...
              if (!cfs_list_empty(&msg->msg_list)) { /* on stealing list */
                  rc = lnet_mt_match_md(mtable, info, msg);

                  if ((rc & LNET_MATCHMD_EXHAUSTED) != 0 &&
                      mtable->mt_enabled)
                      lnet_ptl_disable_mt(ptl, cpt);

                  if ((rc & LNET_MATCHMD_FINISH) != 0)
                      cfs_list_del_init(&msg->msg_list); <=== CRASH (msg->msg_list == NULL)
      ....

      Can you please help with analyzing what can cause the race ?

      Attachments

        Activity

          [LU-7324] Race condition on deleting lnet_msg

          Landed for 2.8.0

          jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8.0

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17840/
          Subject: LU-7324 lnet: Use after free in lnet_ptl_match_delay()
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 607f6919ea67b101796630d4b55649a12ea0e859

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17840/ Subject: LU-7324 lnet: Use after free in lnet_ptl_match_delay() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 607f6919ea67b101796630d4b55649a12ea0e859

          Olaf, yes I think your patch is quite reasonable, thanks.

          liang Liang Zhen (Inactive) added a comment - Olaf, yes I think your patch is quite reasonable, thanks.

          Hi Liang. I had considered adding an LNET_MATCHMD_DELAYED code, and do believe that approach would work. I ultimately decided against it in my proposal because I preferred to not change the interface of lnet_ptl_match_md().

          olaf Olaf Weber (Inactive) added a comment - Hi Liang. I had considered adding an LNET_MATCHMD_DELAYED code, and do believe that approach would work. I ultimately decided against it in my proposal because I preferred to not change the interface of lnet_ptl_match_md() .
          gerrit Gerrit Updater added a comment - - edited

          Please ignore this patch, Patch from Olaf is essentially same, I will review his patch. Sorry I didn't read through the full comments.

          (Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/18081 (abandoned)

          gerrit Gerrit Updater added a comment - - edited Please ignore this patch, Patch from Olaf is essentially same, I will review his patch. Sorry I didn't read through the full comments. (Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/18081 (abandoned)

          Hello Olaf,
          Sorry for this late update but I wanted to be sure to fully understand the involved pieces of code before to answer you.

          And now I can tell that you were already far ahead than me in doing so and that I am now pretty convinced that your patch should make it, and particularly the change to no longer access msg->msg_rx_delayed outside of the lnet_ptl_[un]lock()/lnet_res_[un]lock() protection but better using/checking the LNET_MATCHMD_NONE return-value instead, in both lnet_ptl_match_md() and lnet_ptl_match_delay() routines.

          My previous doubts were coming from the analysis we did for the original crashes reported for this ticket, and where the corruption (re-use after free in fact) was causing a GPF in lnet_ptl_match_delay() when trying to list_del_init(&msg->msg_list), after !list_empty(&msg->msg_list) condition has been verified. But I am now convinced that this can come from the fact, after a re-use after free mainly zeroing old lnet_msg content, msg->msg_rx_delayed could have lead to an unexpected additional loop!

          Thus I have abandonned http://review.whamcloud.com/17847, and since your http://review.whamcloud.com/17840 has triggered some unexpected and related errors I have re-started its auto-tests session.

          bfaccini Bruno Faccini (Inactive) added a comment - Hello Olaf, Sorry for this late update but I wanted to be sure to fully understand the involved pieces of code before to answer you. And now I can tell that you were already far ahead than me in doing so and that I am now pretty convinced that your patch should make it, and particularly the change to no longer access msg->msg_rx_delayed outside of the lnet_ptl_ [un] lock()/lnet_res_ [un] lock() protection but better using/checking the LNET_MATCHMD_NONE return-value instead, in both lnet_ptl_match_md() and lnet_ptl_match_delay() routines. My previous doubts were coming from the analysis we did for the original crashes reported for this ticket, and where the corruption (re-use after free in fact) was causing a GPF in lnet_ptl_match_delay() when trying to list_del_init(&msg->msg_list), after !list_empty(&msg->msg_list) condition has been verified. But I am now convinced that this can come from the fact, after a re-use after free mainly zeroing old lnet_msg content, msg->msg_rx_delayed could have lead to an unexpected additional loop! Thus I have abandonned http://review.whamcloud.com/17847 , and since your http://review.whamcloud.com/17840 has triggered some unexpected and related errors I have re-started its auto-tests session.

          Hi Bruno,

          Can you explain in a bit more detail why you think lnet_res_lock(LNET_LOCK_EX) is required? As far as I can tell, avoiding any reference to a message on the delay queue is sufficient.

          olaf Olaf Weber (Inactive) added a comment - Hi Bruno, Can you explain in a bit more detail why you think lnet_res_lock(LNET_LOCK_EX) is required? As far as I can tell, avoiding any reference to a message on the delay queue is sufficient.

          Hello Olaf, I think your analysis of the code involved with this issue is pretty correct, as far as I also think to understand it!
          But after my 1st fix wrong attempt and next rework with your comments help, I don't see how to quickly fix it without using LNET_LOCK_EX protection in lnet_ptl_match_delay().

          bfaccini Bruno Faccini (Inactive) added a comment - Hello Olaf, I think your analysis of the code involved with this issue is pretty correct, as far as I also think to understand it! But after my 1st fix wrong attempt and next rework with your comments help, I don't see how to quickly fix it without using LNET_LOCK_EX protection in lnet_ptl_match_delay().

          Hi Bruno,

          FWIW, what I understand the code attempts to do is as follows: when looking for a matching MD for an incoming message, it may be the case that there is no such MD associated with the current CPT, as determined by lnet_mt_match_md(). At that point there are two choices: receive the message anyway and cache it locally until an MD is available, or delay receiving the message until and MD becomes available.

          If there is an lnd_t::lnd_eager_recv() callout then the message is received from the network anyway, and kept around on the node until an MD is attached to the portal. This is not the case that results in the failures discussed in this LU.

          If there is no lnd_t::lnd_eager_recv() then lnet_ptl_match_md() calls lnet_ptl_match_delay() to look at the match tables for other CPTs, to see if a matching MD can be found that way. And if there is none, it puts the message on the lnet_portal_t::ptl_msg_delayed queue to be processed when a matching MD is attached by lnet_ptl_attach_md().

          So lnet_ptl_match_delay() and lnet_ptl_attach_md() need to synchronize. The lnet_portal_t::ptl_msg_delayed and lnet_portal_t::ptl_msg_stealing queues are used for that. In lnet_ptl_match_delay() the message is put on the stealing queue so that lnet_ptl_attach_md() can signal that an MD is now available and that the message can be received and processed. lnet_ptl_attach_md() does this by removing it from the stealing queue, and is careful not to touch the message after that.

          In lnet_ptl_match_delay() the message is put on the delay queue so that lnet_ptl_attach_md() can pick it up to complete receiving and processing it. Once it has added a message to the delay queue and dropping the lnet_ptl_lock and lnet_res_lock lnet_ptl_match_delay() has lost control over the message and should no longer reference it, and the same applies to its callers. Unfortunately in the current code it does reference the message after dropping the locks. I believe this is the cause of the crashes.

          Regarding lnet_ptl_match_delay() there are several comments in the code saying that the "delay" case is not expected to happen a lot. That this race condition can be fairly reliably hit argues that some workloads exercise it regularly. And that in turn suggest that the worries (also expressed in the comments) that this code path is comparatively expensive are valid. The amount of work done in the loop in lnet_ptl_match_delay() can be considerable, which is why I think doing the entire loop under lnet_res_lock(LNET_LOCK_EX) might not be a good idea.

          olaf Olaf Weber (Inactive) added a comment - Hi Bruno, FWIW, what I understand the code attempts to do is as follows: when looking for a matching MD for an incoming message, it may be the case that there is no such MD associated with the current CPT, as determined by lnet_mt_match_md() . At that point there are two choices: receive the message anyway and cache it locally until an MD is available, or delay receiving the message until and MD becomes available. If there is an lnd_t::lnd_eager_recv() callout then the message is received from the network anyway, and kept around on the node until an MD is attached to the portal. This is not the case that results in the failures discussed in this LU. If there is no lnd_t::lnd_eager_recv() then lnet_ptl_match_md() calls lnet_ptl_match_delay() to look at the match tables for other CPTs, to see if a matching MD can be found that way. And if there is none, it puts the message on the lnet_portal_t::ptl_msg_delayed queue to be processed when a matching MD is attached by lnet_ptl_attach_md() . So lnet_ptl_match_delay() and lnet_ptl_attach_md() need to synchronize. The lnet_portal_t::ptl_msg_delayed and lnet_portal_t::ptl_msg_stealing queues are used for that. In lnet_ptl_match_delay() the message is put on the stealing queue so that lnet_ptl_attach_md() can signal that an MD is now available and that the message can be received and processed. lnet_ptl_attach_md() does this by removing it from the stealing queue, and is careful not to touch the message after that. In lnet_ptl_match_delay() the message is put on the delay queue so that lnet_ptl_attach_md() can pick it up to complete receiving and processing it. Once it has added a message to the delay queue and dropping the lnet_ptl_lock and lnet_res_lock lnet_ptl_match_delay() has lost control over the message and should no longer reference it, and the same applies to its callers. Unfortunately in the current code it does reference the message after dropping the locks. I believe this is the cause of the crashes. Regarding lnet_ptl_match_delay() there are several comments in the code saying that the "delay" case is not expected to happen a lot. That this race condition can be fairly reliably hit argues that some workloads exercise it regularly. And that in turn suggest that the worries (also expressed in the comments) that this code path is comparatively expensive are valid. The amount of work done in the loop in lnet_ptl_match_delay() can be considerable, which is why I think doing the entire loop under lnet_res_lock(LNET_LOCK_EX) might not be a good idea.

          Hi Bruno,
          Your proposed change prevents additions to the ptl_msg_stealing list – anything added by lnet_ptl_match_delay() will be removed as well before the lnet_res_lock or lnet_ptl_lock are dropped, and lnet_ptl_match_delay() is the only routine that add to this list. Therefore that field and all code relating to it should be removed as well. This might even be a good idea, but I don't think it qualifies as a conservative change.

          olaf Olaf Weber (Inactive) added a comment - Hi Bruno, Your proposed change prevents additions to the ptl_msg_stealing list – anything added by lnet_ptl_match_delay() will be removed as well before the lnet_res_lock or lnet_ptl_lock are dropped, and lnet_ptl_match_delay() is the only routine that add to this list. Therefore that field and all code relating to it should be removed as well. This might even be a good idea, but I don't think it qualifies as a conservative change.

          People

            hdoreau Henri Doreau (Inactive)
            spiechurski Sebastien Piechurski
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: