Details

    • 3
    • 12085

    Description

      <0>LustreError: 5766:0:(ldlm_lock.c:851:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_readers > 0 ) failed: ^M
      <0>LustreError: 5766:0:(ldlm_lock.c:851:ldlm_lock_decref_internal_nolock()) LBUG^M
      <4>Pid: 5766, comm: mdt00_020^M
      <4>^M
      <4>Call Trace:^M
      <4> [<ffffffffa0414895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
      <4> [<ffffffffa0414e97>] lbug_with_loc+0x47/0xb0 [libcfs]^M
      <4> [<ffffffffa06b03b2>] ldlm_lock_decref_internal_nolock+0xd2/0x180 [ptlrpc]^M
      <4> [<ffffffffa06b4aad>] ldlm_lock_decref_internal+0x4d/0xaa0 [ptlrpc]^M
      <4> [<ffffffffa054a315>] ? class_handle2object+0x95/0x190 [obdclass]^M
      <4> [<ffffffffa06b5f69>] ldlm_lock_decref+0x39/0x90 [ptlrpc]^M
      <4> [<ffffffffa0dd74a3>] mdt_save_lock+0x63/0x300 [mdt]^M
      <4> [<ffffffffa06fd900>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]^M
      <4> [<ffffffffa0dd779c>] mdt_object_unlock+0x5c/0x160 [mdt]^M
      <4> [<ffffffffa0e05a4c>] mdt_object_open_unlock+0xac/0x110 [mdt]^M
      <4> [<ffffffffa0e0c9b4>] mdt_reint_open+0xdd4/0x20e0 [mdt]^M
      <4> [<ffffffffa0e0e34c>] mdt_reconstruct_open+0x68c/0xc30 [mdt]^M
      <4> [<ffffffffa07226a6>] ? __req_capsule_get+0x166/0x700 [ptlrpc]^M
      <4> [<ffffffffa06fb1ae>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]^M
      <4> [<ffffffffa0e01195>] mdt_reconstruct+0x45/0x120 [mdt]^M
      <4> [<ffffffffa0ddccfb>] mdt_reint_internal+0x6bb/0x780 [mdt]^M
      <4> [<ffffffffa0ddd08d>] mdt_intent_reint+0x1ed/0x520 [mdt]^M
      <4> [<ffffffffa0ddaf3e>] mdt_intent_policy+0x39e/0x720 [mdt]^M
      <4> [<ffffffffa06b27e1>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]^M
      <4> [<ffffffffa06d924f>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]^M
      <4> [<ffffffffa0ddb3c6>] mdt_enqueue+0x46/0xe0 [mdt]^M
      <4> [<ffffffffa0de1ab7>] mdt_handle_common+0x647/0x16d0 [mdt]^M
      <4> [<ffffffffa0e1b2b5>] mds_regular_handle+0x15/0x20 [mdt]^M
      <4> [<ffffffffa070b428>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]^M
      <4> [<ffffffffa04155de>] ? cfs_timer_arm+0xe/0x10 [libcfs]^M
      <4> [<ffffffffa0426dbf>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]^M
      <4> [<ffffffffa0702789>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]^M
      <4> [<ffffffff810557f3>] ? __wake_up+0x53/0x70^M
      <4> [<ffffffffa070c7be>] ptlrpc_main+0xace/0x1700 [ptlrpc]^M
      <4> [<ffffffffa070bcf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
      <4> [<ffffffff8100c0ca>] child_rip+0xa/0x20^M
      <4> [<ffffffffa070bcf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
      <4> [<ffffffffa070bcf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
      <4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M
      <4>^M

      Attachments

        Activity

          [LU-4403] ASSERTION( lock->l_readers > 0 )
          jamesanunez James Nunez (Inactive) added a comment - Patch for b2_5 at http://review.whamcloud.com/#/c/9779/

          just fyi, we were hit on 9th feb...i've attached the mds.log.
          this is on lustre 2.4.2, haven't patched yet.

          javed javed shaikh (Inactive) added a comment - just fyi, we were hit on 9th feb...i've attached the mds.log. this is on lustre 2.4.2, haven't patched yet.
          pjones Peter Jones added a comment -

          Patch landed for 2.6

          pjones Peter Jones added a comment - Patch landed for 2.6

          My bad. The patch set #5 was in my nas-2.4.0-1 branch, but not
          in nas-2.4.1 branch. It was an earlier patch in nas-2.4.1 branch.

          We just upgraded our server to 2.4.1 yesterday.

          jaylan Jay Lan (Inactive) added a comment - My bad. The patch set #5 was in my nas-2.4.0-1 branch, but not in nas-2.4.1 branch. It was an earlier patch in nas-2.4.1 branch. We just upgraded our server to 2.4.1 yesterday.

          Can you share me the following info:
          1. Comparing to the previous patches, did it last longer after applying patch set 5?
          2. what's the tip of the source tree you're running?

          Jinshan

          jay Jinshan Xiong (Inactive) added a comment - Can you share me the following info: 1. Comparing to the previous patches, did it last longer after applying patch set 5? 2. what's the tip of the source tree you're running? Jinshan

          Patch set 5 didn't fix the issue. We just hit this bug again.

          LustreError: 45299:0:(ldlm_lock.c:851:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_readers > 0 ) failed:
          LustreError: 45299:0:(ldlm_lock.c:851:ldlm_lock_decref_internal_nolock()) LBUG
          Pid: 45299, comm: mdt02_087

          PID: 20719 TASK: ffff880368864aa0 CPU: 24 COMMAND: "mdt01_059"
          #0 [ffff88036c5394c8] machine_kexec at ffffffff81035e8b
          #1 [ffff88036c539528] crash_kexec at ffffffff810c0492
          #2 [ffff88036c5395f8] kdb_kdump_check at ffffffff812858d7
          #3 [ffff88036c539608] kdb_main_loop at ffffffff81288ac7
          #4 [ffff88036c539718] kdb_save_running at ffffffff81282c2e
          #5 [ffff88036c539728] kdba_main_loop at ffffffff81463988
          #6 [ffff88036c539768] kdb at ffffffff81285dc6
          #7 [ffff88036c5397d8] panic at ffffffff8153efbf
          #8 [ffff88036c539858] lbug_with_loc at ffffffffa045deeb [libcfs]
          #9 [ffff88036c539878] ldlm_lock_decref_internal_nolock at ffffffffa0706402 [ptlrpc]
          #10 [ffff88036c539898] ldlm_lock_decref_internal at ffffffffa070aafd [ptlrpc]
          #11 [ffff88036c5398f8] ldlm_lock_decref at ffffffffa070bfb9 [ptlrpc]
          #12 [ffff88036c539928] mdt_save_lock at ffffffffa0e3c483 [mdt]
          #13 [ffff88036c539978] mdt_object_unlock at ffffffffa0e3c77c [mdt]
          #14 [ffff88036c5399a8] mdt_object_open_unlock at ffffffffa0e6acfc [mdt]
          #15 [ffff88036c5399f8] mdt_reint_open at ffffffffa0e71d14 [mdt]
          #16 [ffff88036c539ae8] mdt_reconstruct_open at ffffffffa0e736ac [mdt]
          #17 [ffff88036c539b78] mdt_reconstruct at ffffffffa0e66445 [mdt]
          #18 [ffff88036c539b98] mdt_reint_internal at ffffffffa0e41cfb [mdt]
          #19 [ffff88036c539bd8] mdt_intent_reint at ffffffffa0e42090 [mdt]
          #20 [ffff88036c539c28] mdt_intent_policy at ffffffffa0e3ff3e [mdt]
          #21 [ffff88036c539c68] ldlm_lock_enqueue at ffffffffa0708831 [ptlrpc]
          #22 [ffff88036c539cc8] ldlm_handle_enqueue0 at ffffffffa072f1ef [ptlrpc]
          #23 [ffff88036c539d38] mdt_enqueue at ffffffffa0e403c6 [mdt]
          #24 [ffff88036c539d58] mdt_handle_common at ffffffffa0e46ad7 [mdt]
          #25 [ffff88036c539da8] mds_regular_handle at ffffffffa0e80615 [mdt]
          #26 [ffff88036c539db8] ptlrpc_server_handle_request at ffffffffa07613c8 [ptlrpc]
          #27 [ffff88036c539eb8] ptlrpc_main at ffffffffa076275e [ptlrpc]
          #28 [ffff88036c539f48] kernel_thread at ffffffff8100c0ca

          mhanafi Mahmoud Hanafi added a comment - Patch set 5 didn't fix the issue. We just hit this bug again. LustreError: 45299:0:(ldlm_lock.c:851:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_readers > 0 ) failed: LustreError: 45299:0:(ldlm_lock.c:851:ldlm_lock_decref_internal_nolock()) LBUG Pid: 45299, comm: mdt02_087 PID: 20719 TASK: ffff880368864aa0 CPU: 24 COMMAND: "mdt01_059" #0 [ffff88036c5394c8] machine_kexec at ffffffff81035e8b #1 [ffff88036c539528] crash_kexec at ffffffff810c0492 #2 [ffff88036c5395f8] kdb_kdump_check at ffffffff812858d7 #3 [ffff88036c539608] kdb_main_loop at ffffffff81288ac7 #4 [ffff88036c539718] kdb_save_running at ffffffff81282c2e #5 [ffff88036c539728] kdba_main_loop at ffffffff81463988 #6 [ffff88036c539768] kdb at ffffffff81285dc6 #7 [ffff88036c5397d8] panic at ffffffff8153efbf #8 [ffff88036c539858] lbug_with_loc at ffffffffa045deeb [libcfs] #9 [ffff88036c539878] ldlm_lock_decref_internal_nolock at ffffffffa0706402 [ptlrpc] #10 [ffff88036c539898] ldlm_lock_decref_internal at ffffffffa070aafd [ptlrpc] #11 [ffff88036c5398f8] ldlm_lock_decref at ffffffffa070bfb9 [ptlrpc] #12 [ffff88036c539928] mdt_save_lock at ffffffffa0e3c483 [mdt] #13 [ffff88036c539978] mdt_object_unlock at ffffffffa0e3c77c [mdt] #14 [ffff88036c5399a8] mdt_object_open_unlock at ffffffffa0e6acfc [mdt] #15 [ffff88036c5399f8] mdt_reint_open at ffffffffa0e71d14 [mdt] #16 [ffff88036c539ae8] mdt_reconstruct_open at ffffffffa0e736ac [mdt] #17 [ffff88036c539b78] mdt_reconstruct at ffffffffa0e66445 [mdt] #18 [ffff88036c539b98] mdt_reint_internal at ffffffffa0e41cfb [mdt] #19 [ffff88036c539bd8] mdt_intent_reint at ffffffffa0e42090 [mdt] #20 [ffff88036c539c28] mdt_intent_policy at ffffffffa0e3ff3e [mdt] #21 [ffff88036c539c68] ldlm_lock_enqueue at ffffffffa0708831 [ptlrpc] #22 [ffff88036c539cc8] ldlm_handle_enqueue0 at ffffffffa072f1ef [ptlrpc] #23 [ffff88036c539d38] mdt_enqueue at ffffffffa0e403c6 [mdt] #24 [ffff88036c539d58] mdt_handle_common at ffffffffa0e46ad7 [mdt] #25 [ffff88036c539da8] mds_regular_handle at ffffffffa0e80615 [mdt] #26 [ffff88036c539db8] ptlrpc_server_handle_request at ffffffffa07613c8 [ptlrpc] #27 [ffff88036c539eb8] ptlrpc_main at ffffffffa076275e [ptlrpc] #28 [ffff88036c539f48] kernel_thread at ffffffff8100c0ca

          thanks for the update Jay and good luck with patch set 5.

          jay Jinshan Xiong (Inactive) added a comment - thanks for the update Jay and good luck with patch set 5.

          We had #8642 patch set 5 installed on Jan 8th. Yesterday morning the mds crashed (patch set 4) and booted up with patch set 5.

          Today early morning the mds crashed again; however, it was caused by another bug in OSS and the OSS crash brought down the mds. So, we have patch set 5 running for > 1 day without hitting this problem. We will let it soak more time.

          jaylan Jay Lan (Inactive) added a comment - We had #8642 patch set 5 installed on Jan 8th. Yesterday morning the mds crashed (patch set 4) and booted up with patch set 5. Today early morning the mds crashed again; however, it was caused by another bug in OSS and the OSS crash brought down the mds. So, we have patch set 5 running for > 1 day without hitting this problem. We will let it soak more time.

          drop the priority as there is no response from customer meanwhile I believe we've found the root cause of this issue.

          jay Jinshan Xiong (Inactive) added a comment - drop the priority as there is no response from customer meanwhile I believe we've found the root cause of this issue.

          patch http://review.whamcloud.com/6511 already fixed this problem. Worth trying this alone if you guys have a chance.

          jay Jinshan Xiong (Inactive) added a comment - patch http://review.whamcloud.com/6511 already fixed this problem. Worth trying this alone if you guys have a chance.

          People

            jay Jinshan Xiong (Inactive)
            mhanafi Mahmoud Hanafi
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: