Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5630

mdt_getattr_name_lock()) ASSERTION( lock != NULL )

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.4.2
    • Lustre 2.4.2-14chaos (see github.com/chaos/lustre)
    • 3
    • 15745

    Description

      2014-09-11 21:10:30 LustreError: 0:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.120.199@o2ib7  ns: mdt-lsd-MDT0000_UUID lock: ffff880321a4a480/0x6bd4680b789ee41f lrc: 4/0,0 mode: PR/PR res: [0x2000112f3:0xf:0x0].0 bits 0x13 rrc: 4 type: IBT flags: 0x200000000020 nid: 192.168.120.199@o2ib7 remote: 0xf350c14aff003b28 expref: 30 pid: 17248 timeout: 6838410913 lvb_type: 0 used 0
      2014-09-11 21:10:30 LustreError: 15075:0:(mdt_handler.c:1423:mdt_getattr_name_lock()) ASSERTION( lock != NULL ) failed: Invalid lock handle 0x6bd4680b789ee41f
      2014-09-11 21:10:30 LustreError: 15075:0:(mdt_handler.c:1423:mdt_getattr_name_lock()) LBUG
      2014-09-11 21:10:30 Pid: 15075, comm: mdt00_069
      

      The backtrace is:

      PID: 15075  TASK: ffff880d7001f540  CPU: 2   COMMAND: "mdt00_069"
       #0 [ffff880d70021938] machine_kexec+0x18b at ffffffff810391ab
       #1 [ffff880d70021998] crash_kexec+0x72 at ffffffff810c5ee2
       #2 [ffff880d70021a68] panic+0xae at ffffffff8152b247
       #3 [ffff880d70021ae8] lbug_with_loc+0x9b at ffffffffa0601f4b [libcfs]
       #4 [ffff880d70021b08] mdt_getattr_name_lock+0x18d0 at ffffffffa0e99900 [mdt]
       #5 [ffff880d70021bc8] mdt_intent_getattr+0x29d at ffffffffa0e99c5d [mdt]
       #6 [ffff880d70021c28] mdt_intent_policy+0x39e at ffffffffa0e86fde [mdt]
       #7 [ffff880d70021c68] ldlm_lock_enqueue+0x361 at ffffffffa08b8911 [ptlrpc]
       #8 [ffff880d70021cc8] ldlm_handle_enqueue0+0x4ef at ffffffffa08e1a7f [ptlrpc]
       #9 [ffff880d70021d38] mdt_enqueue+0x46 at ffffffffa0e87466 [mdt]
      #10 [ffff880d70021d58] mdt_handle_common+0x647 at ffffffffa0e8c0d7 [mdt]
      #11 [ffff880d70021da8] mds_regular_handle+0x15 at ffffffffa0ec7c75 [mdt]
      #12 [ffff880d70021db8] ptlrpc_server_handle_request+0x398 at ffffffffa0912188 [ptlrpc]
      #13 [ffff880d70021eb8] ptlrpc_main+0xace at ffffffffa091351e [ptlrpc]
      #14 [ffff880d70021f48] child_rip+0xa at ffffffff8100c24a
      

      This looks like the same assertion assertion as LU-5579, but that was presumably hit on Lustre 2.6 or later.

      Attachments

        Issue Links

          Activity

            [LU-5630] mdt_getattr_name_lock()) ASSERTION( lock != NULL )
            green Oleg Drokin added a comment -

            I suspect ESTALE would propagate all the way up to userspace.

            On the other hand, if it's due to eviction of that same client, it does not matter due to a bunch of EIO and other stuff this client will get anyway.
            In case of the Vitaly-described race where resend happens in parallel with delayed delivery of RPC for which the resend happened, ESTALE is just going to be dropped because the client will not be waiting for this duplicate reply.

            green Oleg Drokin added a comment - I suspect ESTALE would propagate all the way up to userspace. On the other hand, if it's due to eviction of that same client, it does not matter due to a bunch of EIO and other stuff this client will get anyway. In case of the Vitaly-described race where resend happens in parallel with delayed delivery of RPC for which the resend happened, ESTALE is just going to be dropped because the client will not be waiting for this duplicate reply.

            How will the client behave when it gets ESTALE?

            morrone Christopher Morrone (Inactive) added a comment - How will the client behave when it gets ESTALE?
            green Oleg Drokin added a comment -

            Yes, I think the bug is the same.
            Quickfix for b2_4 would be to just replace assert with return -ESTALE;

            This is not the final solution, I am starting to have my doubts that we should return ESTALE on resend as the client is not really at fault here and reprocessign the entire request might be a better idea.
            I am going to disucss this idea with Vitaly, but at least this will fix the crash for now.

            green Oleg Drokin added a comment - Yes, I think the bug is the same. Quickfix for b2_4 would be to just replace assert with return -ESTALE; This is not the final solution, I am starting to have my doubts that we should return ESTALE on resend as the client is not really at fault here and reprocessign the entire request might be a better idea. I am going to disucss this idea with Vitaly, but at least this will fix the crash for now.
            pjones Peter Jones added a comment -

            Oleg

            Can you confirm whether this is a duplicate of LU-5579?

            Thanks

            Peter

            pjones Peter Jones added a comment - Oleg Can you confirm whether this is a duplicate of LU-5579 ? Thanks Peter

            I think this is an issue we also hit on master, Vitaly has already posted a patch on LU-5579

            liang Liang Zhen (Inactive) added a comment - I think this is an issue we also hit on master, Vitaly has already posted a patch on LU-5579

            People

              green Oleg Drokin
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: