Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5287

(ldlm_lib.c:2253:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed

Details

    • 3
    • 14750

    Description

      Running racer with 2 clients MDSCOUNT=1 and 2.5.60-90-g37432a8 + http://review.whamcloud.com/#/c/5936/ I see this when restarting a crashed OST with some clients still mounted.

      [  230.089707] Lustre: Skipped 75 previous similar messages
      [  231.775205] Lustre: 2151:0:(client.c:1924:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404323793/real 1404323793]  req@ffff8801f78fc110 x1472540086110788/t0(0) o400->lustre-OST0001-osc-MDT0000@0@lo:28/4 lens 224/224 e 1 to 1 dl 1404323837 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
      [  237.775938] Lustre: lustre-OST0001: Denying connection for new client cc64d6dc-4180-e700-9f7e-ce147524a8f0 (at 0@lo), waiting for all 4 known clients (2 recovered, 1 in progress, and 1 evicted) to recover in 0:36
      [  237.781858] Lustre: Skipped 3 previous similar messages
      [  242.801254] LustreError: 2880:0:(ldlm_lib.c:2253:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed: 
      [  242.805102] LustreError: 2880:0:(ldlm_lib.c:2253:target_queue_recovery_request()) LBUG
      [  242.807953] Pid: 2880, comm: ll_ost00_007
      [  242.809274] 
      [  242.809276] Call Trace:
      [  242.810585]  [<ffffffffa02b98c5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [  242.812764]  [<ffffffffa02b9ec7>] lbug_with_loc+0x47/0xb0 [libcfs]
      [  242.814689]  [<ffffffffa064ea0c>] target_queue_recovery_request+0xbac/0xc10 [ptlrpc]
      [  242.816347]  [<ffffffffa06e122f>] tgt_handle_recovery+0x38f/0x520 [ptlrpc]
      [  242.817666]  [<ffffffffa06e6b8d>] tgt_request_handle+0x18d/0xad0 [ptlrpc]
      [  242.818987]  [<ffffffffa0699e31>] ptlrpc_main+0xcf1/0x1880 [ptlrpc]
      [  242.820261]  [<ffffffffa0699140>] ? ptlrpc_main+0x0/0x1880 [ptlrpc]
      [  242.821440]  [<ffffffff8109eab6>] kthread+0x96/0xa0
      [  242.822360]  [<ffffffff8100c30a>] child_rip+0xa/0x20
      [  242.823303]  [<ffffffff81554710>] ? _spin_unlock_irq+0x30/0x40
      [  242.824390]  [<ffffffff8100bb10>] ? restore_args+0x0/0x30
      [  242.825391]  [<ffffffff8109ea20>] ? kthread+0x0/0xa0
      [  242.826315]  [<ffffffff8100c300>] ? child_rip+0x0/0x20
      [  242.827283] 
      

      Attachments

        Issue Links

          Activity

            [LU-5287] (ldlm_lib.c:2253:target_queue_recovery_request()) ASSERTION( req->rq_export->exp_lock_replay_needed ) failed

            Thanks Niu.

            We have faced similar issue on our set up. I will update my further investigation here.

            vinayakh Vinayak (Inactive) added a comment - Thanks Niu. We have faced similar issue on our set up. I will update my further investigation here.

            Right, it's safe if there isn't any concurrent changes.

            The importance of lock here is that there could be concurrent changes, I didn't take a thoroughly retrospect on the code changes, it looks concurrency is possible at first glance. Even if no concurrent changes in present code, I don't think we can hypothesize that concurrent changes will never happen, taking lock is the safe way to avoid nasty bugs.

            niu Niu Yawei (Inactive) added a comment - Right, it's safe if there isn't any concurrent changes. The importance of lock here is that there could be concurrent changes, I didn't take a thoroughly retrospect on the code changes, it looks concurrency is possible at first glance. Even if no concurrent changes in present code, I don't think we can hypothesize that concurrent changes will never happen, taking lock is the safe way to avoid nasty bugs.

            Thanks Niu.

            Please correct me if wrong.

            so flag change operation needs to take lock

            If the operations are not concurrent then it does not matter whether bit filed change happens with lock or without lock. The whole unsigned long will not be affected if we change any of the bits. Am I right ?

            load the whole unsigned long, change a bit, write the whole unsigned long

            what is the real importance of lock here ?

            Thanks in advance,

            vinayakh Vinayak (Inactive) added a comment - Thanks Niu. Please correct me if wrong. so flag change operation needs to take lock If the operations are not concurrent then it does not matter whether bit filed change happens with lock or without lock. The whole unsigned long will not be affected if we change any of the bits. Am I right ? load the whole unsigned long, change a bit, write the whole unsigned long what is the real importance of lock here ? Thanks in advance,

            All export flags share the same "unsigned long" and flag set operation "export->exp_xxx = 1" isn't atomic (load the whole unsigned long, change a bit, write the whole unsigned long), so flag change operation needs to take lock.

            niu Niu Yawei (Inactive) added a comment - All export flags share the same "unsigned long" and flag set operation "export->exp_xxx = 1" isn't atomic (load the whole unsigned long, change a bit, write the whole unsigned long), so flag change operation needs to take lock.
            vinayakh Vinayak (Inactive) added a comment - - edited

            Hello Niu,

            I am trying to understand http://review.whamcloud.com/#/c/12162.

            Can you please help me to understand how it affects req->rq_export->exp_lock_replay_needed flag ?

            Thanks in advance,

            vinayakh Vinayak (Inactive) added a comment - - edited Hello Niu, I am trying to understand http://review.whamcloud.com/#/c/12162 . Can you please help me to understand how it affects req->rq_export->exp_lock_replay_needed flag ? Thanks in advance,
            pjones Peter Jones added a comment -

            Landed for 2.5.4 and 2.7

            pjones Peter Jones added a comment - Landed for 2.5.4 and 2.7
            bogl Bob Glossman (Inactive) added a comment - backports to b2_5: http://review.whamcloud.com/#/c/12162 http://review.whamcloud.com/#/c/12163

            All of the patches needed to fix this ticket's assertion.

            But actually, we are moving to 2.5 soon, so patches for b2_5 should be sufficient.

            morrone Christopher Morrone (Inactive) added a comment - All of the patches needed to fix this ticket's assertion. But actually, we are moving to 2.5 soon, so patches for b2_5 should be sufficient.

            Christopher, which patch(es) do you need back ported to b2_4? Don't want to do too much or too little.

            bogl Bob Glossman (Inactive) added a comment - Christopher, which patch(es) do you need back ported to b2_4? Don't want to do too much or too little.

            We just had 181 servers crash with this assertion when starting up 2.4.2-16chaos (see github.com/chaos/lustre) on the servers for the first time. We need a patch for our branch as well.

            morrone Christopher Morrone (Inactive) added a comment - We just had 181 servers crash with this assertion when starting up 2.4.2-16chaos (see github.com/chaos/lustre) on the servers for the first time. We need a patch for our branch as well.

            People

              niu Niu Yawei (Inactive)
              jhammond John Hammond
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: