Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9504

LBUG ptlrpc_handle_rs()) ASSERTION( lock != ((void *)0) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.10.0
    • Lustre 2.10.0
    • Soak stress cluster
    • 3
    • 9223372036854775807

    Description

      Loaded latest master build, started soak. Only fault induced was router drop.
      Soak-9 is the second MDS (MDT0001)
      soak-9 has a hard crash in normal operation:
      Soak is started, mount completes

      [  893.779340] LustreError: 11-0: soaked-OST000c-osc-MDT0001: operation ost_connect to node 192.168.1.102@o2ib10 failed: rc = -16^M
      [  893.801270] LustreError: Skipped 139 previous similar messages^M
      [  894.095801] Lustre: soaked-MDT0003-osp-MDT0001: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)^M
      [  894.110919] Lustre: Skipped 7 previous similar messages^M
      [  894.253639] Lustre: soaked-MDT0001: recovery is timed out, evict stale exports^M
      [  894.265333] Lustre: soaked-MDT0001: disconnecting 28 stale clients^M
      [  894.278063] Lustre: soaked-MDT0001: Recovery over after 5:01, of 31 clients 3 recovered and 28 were evicted.^M
      [ 1465.548946] Lustre: soaked-MDT0001: Client 8e539072-a775-2171-7825-433ade3d0c39 (at 192.168.1.132@o2ib100) reconnecting^M
      [ 1465.563698] Lustre: soaked-MDT0001: Connection restored to 8e539072-a775-2171-7825-433ade3d0c39 (at 192.168.1.132@o2ib100)^M
      [ 1465.579478] Lustre: Skipped 30 previous similar messages^M
      [ 1757.516066] Lustre: soaked-MDT0001: Client 6fb4512c-e89d-8233-88ad-b696d11c9821 (at 192.168.1.138@o2ib100) reconnecting^M
      [ 1757.531619] Lustre: soaked-MDT0001: Connection restored to 6fb4512c-e89d-8233-88ad-b696d11c9821 (at 192.168.1.138@o2ib100)^M
      [ 1904.507160] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
      [ 1942.000624] Lustre: soaked-MDT0001: Client 0a5dee9d-f606-8dd1-e9d4-d42a75c735e1 (at 192.168.1.129@o2ib100) reconnecting^M
      [ 2436.463475] Lustre: soaked-MDT0001: Client 2d2ea61b-cf5f-add6-6e06-30458f85a726 (at 192.168.1.139@o2ib100) reconnecting^M
      [ 2436.478139] Lustre: soaked-MDT0001: Connection restored to 2d2ea61b-cf5f-add6-6e06-30458f85a726 (at 192.168.1.139@o2ib100)^M
      [ 2436.493810] Lustre: Skipped 2 previous similar messages^M
      [ 2733.438247] Lustre: soaked-MDT0001: Client cb715667-fb6c-b895-e632-274b232c5bc9 (at 192.168.1.119@o2ib100) reconnecting^M
      [ 3117.401693] Lustre: soaked-MDT0001: Client 0a5dee9d-f606-8dd1-e9d4-d42a75c735e1 (at 192.168.1.129@o2ib100) reconnecting^M
      [ 3117.416360] Lustre: soaked-MDT0001: Connection restored to 0a5dee9d-f606-8dd1-e9d4-d42a75c735e1 (at 192.168.1.129@o2ib100)^M
      [ 3117.430856] Lustre: Skipped 1 previous similar message^M
      [ 3359.388740] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
      [ 3497.321905] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
      [ 3951.338201] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
      [ 3951.353148] Lustre: soaked-MDT0001: Connection restored to 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100)^M
      [ 3951.368647] Lustre: Skipped 2 previous similar messages^M
      [ 4415.324113] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
      [ 4501.233379] Lustre: 3876:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1494873743/real 1494873743]  req@ffff8803b3329b00 x1567484370368192/t0(0) o104->soaked-MDT0001@192.168.1.139@o2ib100:15/16 lens 296/224 e 0 to 1 dl 1494873750 ref 1 fl Rpc:X/0/ffffffff rc 0/-1^M
      [ 4501.273401] Lustre: 3876:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages^M
      [ 4632.525998] LustreError: 3049:0:(service.c:2229:ptlrpc_handle_rs()) ASSERTION( lock != ((void *)0) ) failed: ^M
      [ 4632.539820] LustreError: 3049:0:(service.c:2229:ptlrpc_handle_rs()) LBUG^M
      [ 4632.550400] Pid: 3049, comm: ptlrpc_hr01_004^M
      [ 4632.557537] ^M
      [ 4632.557537] Call Trace:^M
      [ 4632.566532]  [<ffffffffa080e7ee>] libcfs_call_trace+0x4e/0x60 [libcfs]^M
      [ 4632.575923]  [<ffffffffa080e87c>] lbug_with_loc+0x4c/0xb0 [libcfs]^M
      [ 4632.585006]  [<ffffffffa0ba2bed>] ptlrpc_hr_main+0x83d/0x8f0 [ptlrpc]^M
      [ 4632.594107]  [<ffffffff810c8345>] ? sched_clock_cpu+0x85/0xc0^M
      [ 4632.602416]  [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20^M
      [ 4632.611086]  [<ffffffffa0ba23b0>] ? ptlrpc_hr_main+0x0/0x8f0 [ptlrpc]^M
      [ 4632.620009]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0^M
      [ 4632.627027]  [<ffffffff810b0980>] ? kthread+0x0/0xe0^M
      [ 4632.634106]  [<ffffffff81697318>] ret_from_fork+0x58/0x90^M
      [ 4632.641571]  [<ffffffff810b0980>] ? kthread+0x0/0xe0^M
      [ 4632.648510] ^M
      [ 4632.651470] Kernel panic - not syncing: LBUG^M
      [ 4632.657497] CPU: 11 PID: 3049 Comm: ptlrpc_hr01_004 Tainted: P           OE  ------------   3.10.0-514.16.1.el7_lustre.x86_64 #1^M
      [ 4632.673033] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [ 4632.686895]  ffffffffa082cdac 00000000329e47aa ffff8808212b7d30 ffffffff81686d1f^M
      [ 4632.696571]  ffff8808212b7db0 ffffffff8168014a ffffffff00000008 ffff8808212b7dc0^M
      [ 4632.706233]  ffff8808212b7d60 00000000329e47aa 00000000329e47aa ffff88082d8cf838^M
      [ 4632.715875] Call Trace:^M
      [ 4632.715875] Call Trace:^M
      [ 4632.719901]  [<ffffffff81686d1f>] dump_stack+0x19/0x1b^M
      [ 4632.726949]  [<ffffffff8168014a>] panic+0xe3/0x1f2^M
      [ 4632.733611]  [<ffffffffa080e894>] lbug_with_loc+0x64/0xb0 [libcfs]^M
      [ 4632.741856]  [<ffffffffa0ba2bed>] ptlrpc_hr_main+0x83d/0x8f0 [ptlrpc]^M
      [ 4632.750326]  [<ffffffff810c8345>] ? sched_clock_cpu+0x85/0xc0^M
      [ 4632.757970]  [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20^M
      [ 4632.765508]  [<ffffffffa0ba23b0>] ? ptlrpc_svcpt_stop_threads+0x590/0x590 [ptlrpc]^M
      [ 4632.775290]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0^M
      [ 4632.781973]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M
      [ 4632.790485]  [<ffffffff81697318>] ret_from_fork+0x58/0x90^M
      [ 4632.797662]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M
      

      System then crashed, crash dump is available on the node.
      vmcore-dmesg attached.

      Attachments

        Issue Links

          Activity

            [LU-9504] LBUG ptlrpc_handle_rs()) ASSERTION( lock != ((void *)0) ) failed
            pjones Peter Jones made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: In Progress [ 3 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            Landed for 2.10

            pjones Peter Jones added a comment - Landed for 2.10

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27207/
            Subject: LU-9504 ptlrpc: REP-ACK hr may race with trans commit
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ca4659c9d1f010c1046b634cf5a592a620ac3935

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27207/ Subject: LU-9504 ptlrpc: REP-ACK hr may race with trans commit Project: fs/lustre-release Branch: master Current Patch Set: Commit: ca4659c9d1f010c1046b634cf5a592a620ac3935

            I have run the latest patch for 48 hours on soak. No LBUGS. LFSCK does not complete, but does abort successfully

            cliffw Cliff White (Inactive) added a comment - I have run the latest patch for 48 hours on soak. No LBUGS. LFSCK does not complete, but does abort successfully
            laisiyao Lai Siyao added a comment -

            Cliff, thanks! This helps a lot, so what happened is as below:
            1. ptlrpc_handle_rs() handle REPACK rs, and is about to convert PW lock A to COS mode.
            2. transaction committed, ptlrpc_handle_rs() is called again which decref lock A.
            3. there comes a conflicting lock B, since there is no reader or writer for lock A, it will cancel lock A, which will set its l_granted_mode to 0.
            4. step 1 continues, ASSERTION( lock->l_granted_mode & (LCK_PW | LCK_EX) ) is triggered.

            I'll update the patch later.

            laisiyao Lai Siyao added a comment - Cliff, thanks! This helps a lot, so what happened is as below: 1. ptlrpc_handle_rs() handle REPACK rs, and is about to convert PW lock A to COS mode. 2. transaction committed, ptlrpc_handle_rs() is called again which decref lock A. 3. there comes a conflicting lock B, since there is no reader or writer for lock A, it will cancel lock A, which will set its l_granted_mode to 0. 4. step 1 continues, ASSERTION( lock->l_granted_mode & (LCK_PW | LCK_EX) ) is triggered. I'll update the patch later.

            Attached three lustre-log, one from immediately before the LBUG, one after

            cliffw Cliff White (Inactive) added a comment - Attached three lustre-log, one from immediately before the LBUG, one after
            cliffw Cliff White (Inactive) made changes -
            Attachment New: lustre-log.1497161984.5008.preLBUG.txt.gz [ 26968 ]
            Attachment New: lustre-log.1497165943.4581.txt.gz [ 26969 ]
            Attachment New: lustre-log.1497166143.4807.postLBUG.txt.gz [ 26970 ]

            Failed again:

            Jun 11 07:25:40 soak-11 kernel: perf: interrupt took too long (2510 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
            Jun 11 07:25:43 soak-11 kernel: LustreError: 4581:0:(ldlm_lock.c:2549:ldlm_lock_downgrade()) ### weird lock mode ns: mdt-soaked-MDT0003_UUID lock: ffff8806b57f9800/0x3e1187b8839d4a55 lrc: 1/0,0 mode: --/PW res: [0x2c006071b:0x169d:0x0].0x0 bits 0x2 rrc: 2 type: IBT flags: 0x44a19401000000 nid: local remote: 0x0 expref: -99 pid: 7178 timeout: 0 lvb_type: 0
            Jun 11 07:25:43 soak-11 kernel: LustreError: 4581:0:(ldlm_lock.c:2550:ldlm_lock_downgrade()) ASSERTION( lock->l_granted_mode & (LCK_PW | LCK_EX) ) failed:
            Jun 11 07:25:43 soak-11 kernel: LustreError: 4581:0:(ldlm_lock.c:2550:ldlm_lock_downgrade()) LBUG
            Jun 11 07:25:43 soak-11 kernel: Pid: 4581, comm: ptlrpc_hr01_007
            Jun 11 07:25:43 soak-11 kernel: #012Call Trace:
            Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0bca7ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
            Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0bca87c>] lbug_with_loc+0x4c/0xb0 [libcfs]
            Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0eed7b1>] ldlm_lock_downgrade+0x111/0x210 [ptlrpc]
            Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0f44818>] ptlrpc_handle_rs+0x5c8/0x700 [ptlrpc]
            Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0f44a35>] ptlrpc_hr_main+0xe5/0x2c0 [ptlrpc]
            Jun 11 07:25:43 soak-11 kernel: [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20  
            Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0f44950>] ? ptlrpc_hr_main+0x0/0x2c0 [ptlrpc]
            Jun 11 07:25:43 soak-11 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
            Jun 11 07:25:43 soak-11 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0
            Jun 11 07:25:43 soak-11 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90
            Jun 11 07:25:43 soak-11 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0
            Jun 11 07:25:43 soak-11 kernel:
            Jun 11 07:25:43 soak-11 kernel: LustreError: dumping log to /tmp/lustre-log.1497165943.4581 
            
            cliffw Cliff White (Inactive) added a comment - Failed again: Jun 11 07:25:40 soak-11 kernel: perf: interrupt took too long (2510 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 Jun 11 07:25:43 soak-11 kernel: LustreError: 4581:0:(ldlm_lock.c:2549:ldlm_lock_downgrade()) ### weird lock mode ns: mdt-soaked-MDT0003_UUID lock: ffff8806b57f9800/0x3e1187b8839d4a55 lrc: 1/0,0 mode: --/PW res: [0x2c006071b:0x169d:0x0].0x0 bits 0x2 rrc: 2 type: IBT flags: 0x44a19401000000 nid: local remote: 0x0 expref: -99 pid: 7178 timeout: 0 lvb_type: 0 Jun 11 07:25:43 soak-11 kernel: LustreError: 4581:0:(ldlm_lock.c:2550:ldlm_lock_downgrade()) ASSERTION( lock->l_granted_mode & (LCK_PW | LCK_EX) ) failed: Jun 11 07:25:43 soak-11 kernel: LustreError: 4581:0:(ldlm_lock.c:2550:ldlm_lock_downgrade()) LBUG Jun 11 07:25:43 soak-11 kernel: Pid: 4581, comm: ptlrpc_hr01_007 Jun 11 07:25:43 soak-11 kernel: #012Call Trace: Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0bca7ee>] libcfs_call_trace+0x4e/0x60 [libcfs] Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0bca87c>] lbug_with_loc+0x4c/0xb0 [libcfs] Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0eed7b1>] ldlm_lock_downgrade+0x111/0x210 [ptlrpc] Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0f44818>] ptlrpc_handle_rs+0x5c8/0x700 [ptlrpc] Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0f44a35>] ptlrpc_hr_main+0xe5/0x2c0 [ptlrpc] Jun 11 07:25:43 soak-11 kernel: [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20 Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0f44950>] ? ptlrpc_hr_main+0x0/0x2c0 [ptlrpc] Jun 11 07:25:43 soak-11 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0 Jun 11 07:25:43 soak-11 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0 Jun 11 07:25:43 soak-11 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90 Jun 11 07:25:43 soak-11 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0 Jun 11 07:25:43 soak-11 kernel: Jun 11 07:25:43 soak-11 kernel: LustreError: dumping log to /tmp/lustre-log.1497165943.4581

            Okay, loading that build now.

            cliffw Cliff White (Inactive) added a comment - Okay, loading that build now.
            laisiyao Lai Siyao added a comment - - edited

            strange, can you access https://review.whamcloud.com/#/c/27207/7 ? which is patch set 7, and the build is https://build.hpdd.intel.com/job/lustre-reviews/48061/.

            I just checked build 48063, which is for LU-9049.

            laisiyao Lai Siyao added a comment - - edited strange, can you access https://review.whamcloud.com/#/c/27207/7 ? which is patch set 7, and the build is https://build.hpdd.intel.com/job/lustre-reviews/48061/ . I just checked build 48063, which is for LU-9049 .

            People

              laisiyao Lai Siyao
              cliffw Cliff White (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: