Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7324

Race condition on deleting lnet_msg

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.5.3
    • bull lustre 2.5.3.90
    • 3
    • 9223372036854775807

    Description

      We hit several times on different OSS what looks like a race condition on freeing an lnet_msg.
      The crash looks as follows:

      LustreError: 25277:0:(lu_object.c:1463:key_fini()) ASSERTION( atomic_read(&key->lct_used) > 1 ) failed:
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      IP: [<ffffffffa0602924>] lnet_ptl_match_md+0x3b4/0x870 [lnet]
      PGD 0
      Oops: 0002 1 SMP

      with this stack:

      crash> bt
      PID: 2163   TASK: ffff8803b9df2040  CPU: 25  COMMAND: "kiblnd_sd_01_02"
       #0 [ffff880340ceb7e0] machine_kexec at ffffffff8103d30b
       #1 [ffff880340ceb840] crash_kexec at ffffffff810cc4f2
       #2 [ffff880340ceb910] oops_end at ffffffff8153d3d0
       #3 [ffff880340ceb940] no_context at ffffffff8104e8cb
       #4 [ffff880340ceb990] __bad_area_nosemaphore at ffffffff8104eb55
       #5 [ffff880340ceb9e0] bad_area_nosemaphore at ffffffff8104ec23
       #6 [ffff880340ceb9f0] __do_page_fault at ffffffff8104f31c
       #7 [ffff880340cebb10] do_page_fault at ffffffff8153f31e
       #8 [ffff880340cebb40] page_fault at ffffffff8153c6c5
          [exception RIP: lnet_ptl_match_md+948]
          RIP: ffffffffa0602924  RSP: ffff880340cebbf0  RFLAGS: 00010202
          RAX: 0000000000000000  RBX: ffff880340cebcf0  RCX: ffff880c75631940
          RDX: 0000000000000000  RSI: 0000000000000001  RDI: ffff880bc4ed4550
          RBP: ffff880340cebc70   R8: ffff880bc4ed4550   R9: a500000000000000
          R10: 0000000000000000  R11: 0000000000000000  R12: ffff880749841800
          R13: 0000000000000002  R14: ffff881078f8b2c0  R15: 0000000000000002
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #9 [ffff880340cebc78] lnet_parse at ffffffffa0609fb3 [lnet]
      #10 [ffff880340cebd58] kiblnd_handle_rx at ffffffffa09f19db [ko2iblnd]
      #11 [ffff880340cebda8] kiblnd_rx_complete at ffffffffa09f26c3 [ko2iblnd]
      #12 [ffff880340cebdf8] kiblnd_complete at ffffffffa09f2872 [ko2iblnd]
      #13 [ffff880340cebe08] kiblnd_scheduler at ffffffffa09f2c2a [ko2iblnd]
      #14 [ffff880340cebee8] kthread at ffffffff810a101e
      #15 [ffff880340cebf48] kernel_thread at ffffffff8100c28a

      The crash seems to occur here:
      lnet_ptl_match_delay(struct lnet_portal *ptl,
                   struct lnet_match_info *info, struct lnet_msg *msg)
      {
      ...
              if (!cfs_list_empty(&msg->msg_list)) { /* on stealing list */
                  rc = lnet_mt_match_md(mtable, info, msg);

                  if ((rc & LNET_MATCHMD_EXHAUSTED) != 0 &&
                      mtable->mt_enabled)
                      lnet_ptl_disable_mt(ptl, cpt);

                  if ((rc & LNET_MATCHMD_FINISH) != 0)
                      cfs_list_del_init(&msg->msg_list); <=== CRASH (msg->msg_list == NULL)
      ....

      Can you please help with analyzing what can cause the race ?

      Attachments

        Activity

          People

            hdoreau Henri Doreau (Inactive)
            spiechurski Sebastien Piechurski
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: