Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9504

LBUG ptlrpc_handle_rs()) ASSERTION( lock != ((void *)0) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.10.0
    • Lustre 2.10.0
    • Soak stress cluster
    • 3
    • 9223372036854775807

    Description

      Loaded latest master build, started soak. Only fault induced was router drop.
      Soak-9 is the second MDS (MDT0001)
      soak-9 has a hard crash in normal operation:
      Soak is started, mount completes

      [  893.779340] LustreError: 11-0: soaked-OST000c-osc-MDT0001: operation ost_connect to node 192.168.1.102@o2ib10 failed: rc = -16^M
      [  893.801270] LustreError: Skipped 139 previous similar messages^M
      [  894.095801] Lustre: soaked-MDT0003-osp-MDT0001: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)^M
      [  894.110919] Lustre: Skipped 7 previous similar messages^M
      [  894.253639] Lustre: soaked-MDT0001: recovery is timed out, evict stale exports^M
      [  894.265333] Lustre: soaked-MDT0001: disconnecting 28 stale clients^M
      [  894.278063] Lustre: soaked-MDT0001: Recovery over after 5:01, of 31 clients 3 recovered and 28 were evicted.^M
      [ 1465.548946] Lustre: soaked-MDT0001: Client 8e539072-a775-2171-7825-433ade3d0c39 (at 192.168.1.132@o2ib100) reconnecting^M
      [ 1465.563698] Lustre: soaked-MDT0001: Connection restored to 8e539072-a775-2171-7825-433ade3d0c39 (at 192.168.1.132@o2ib100)^M
      [ 1465.579478] Lustre: Skipped 30 previous similar messages^M
      [ 1757.516066] Lustre: soaked-MDT0001: Client 6fb4512c-e89d-8233-88ad-b696d11c9821 (at 192.168.1.138@o2ib100) reconnecting^M
      [ 1757.531619] Lustre: soaked-MDT0001: Connection restored to 6fb4512c-e89d-8233-88ad-b696d11c9821 (at 192.168.1.138@o2ib100)^M
      [ 1904.507160] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
      [ 1942.000624] Lustre: soaked-MDT0001: Client 0a5dee9d-f606-8dd1-e9d4-d42a75c735e1 (at 192.168.1.129@o2ib100) reconnecting^M
      [ 2436.463475] Lustre: soaked-MDT0001: Client 2d2ea61b-cf5f-add6-6e06-30458f85a726 (at 192.168.1.139@o2ib100) reconnecting^M
      [ 2436.478139] Lustre: soaked-MDT0001: Connection restored to 2d2ea61b-cf5f-add6-6e06-30458f85a726 (at 192.168.1.139@o2ib100)^M
      [ 2436.493810] Lustre: Skipped 2 previous similar messages^M
      [ 2733.438247] Lustre: soaked-MDT0001: Client cb715667-fb6c-b895-e632-274b232c5bc9 (at 192.168.1.119@o2ib100) reconnecting^M
      [ 3117.401693] Lustre: soaked-MDT0001: Client 0a5dee9d-f606-8dd1-e9d4-d42a75c735e1 (at 192.168.1.129@o2ib100) reconnecting^M
      [ 3117.416360] Lustre: soaked-MDT0001: Connection restored to 0a5dee9d-f606-8dd1-e9d4-d42a75c735e1 (at 192.168.1.129@o2ib100)^M
      [ 3117.430856] Lustre: Skipped 1 previous similar message^M
      [ 3359.388740] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
      [ 3497.321905] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
      [ 3951.338201] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
      [ 3951.353148] Lustre: soaked-MDT0001: Connection restored to 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100)^M
      [ 3951.368647] Lustre: Skipped 2 previous similar messages^M
      [ 4415.324113] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
      [ 4501.233379] Lustre: 3876:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1494873743/real 1494873743]  req@ffff8803b3329b00 x1567484370368192/t0(0) o104->soaked-MDT0001@192.168.1.139@o2ib100:15/16 lens 296/224 e 0 to 1 dl 1494873750 ref 1 fl Rpc:X/0/ffffffff rc 0/-1^M
      [ 4501.273401] Lustre: 3876:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages^M
      [ 4632.525998] LustreError: 3049:0:(service.c:2229:ptlrpc_handle_rs()) ASSERTION( lock != ((void *)0) ) failed: ^M
      [ 4632.539820] LustreError: 3049:0:(service.c:2229:ptlrpc_handle_rs()) LBUG^M
      [ 4632.550400] Pid: 3049, comm: ptlrpc_hr01_004^M
      [ 4632.557537] ^M
      [ 4632.557537] Call Trace:^M
      [ 4632.566532]  [<ffffffffa080e7ee>] libcfs_call_trace+0x4e/0x60 [libcfs]^M
      [ 4632.575923]  [<ffffffffa080e87c>] lbug_with_loc+0x4c/0xb0 [libcfs]^M
      [ 4632.585006]  [<ffffffffa0ba2bed>] ptlrpc_hr_main+0x83d/0x8f0 [ptlrpc]^M
      [ 4632.594107]  [<ffffffff810c8345>] ? sched_clock_cpu+0x85/0xc0^M
      [ 4632.602416]  [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20^M
      [ 4632.611086]  [<ffffffffa0ba23b0>] ? ptlrpc_hr_main+0x0/0x8f0 [ptlrpc]^M
      [ 4632.620009]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0^M
      [ 4632.627027]  [<ffffffff810b0980>] ? kthread+0x0/0xe0^M
      [ 4632.634106]  [<ffffffff81697318>] ret_from_fork+0x58/0x90^M
      [ 4632.641571]  [<ffffffff810b0980>] ? kthread+0x0/0xe0^M
      [ 4632.648510] ^M
      [ 4632.651470] Kernel panic - not syncing: LBUG^M
      [ 4632.657497] CPU: 11 PID: 3049 Comm: ptlrpc_hr01_004 Tainted: P           OE  ------------   3.10.0-514.16.1.el7_lustre.x86_64 #1^M
      [ 4632.673033] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [ 4632.686895]  ffffffffa082cdac 00000000329e47aa ffff8808212b7d30 ffffffff81686d1f^M
      [ 4632.696571]  ffff8808212b7db0 ffffffff8168014a ffffffff00000008 ffff8808212b7dc0^M
      [ 4632.706233]  ffff8808212b7d60 00000000329e47aa 00000000329e47aa ffff88082d8cf838^M
      [ 4632.715875] Call Trace:^M
      [ 4632.715875] Call Trace:^M
      [ 4632.719901]  [<ffffffff81686d1f>] dump_stack+0x19/0x1b^M
      [ 4632.726949]  [<ffffffff8168014a>] panic+0xe3/0x1f2^M
      [ 4632.733611]  [<ffffffffa080e894>] lbug_with_loc+0x64/0xb0 [libcfs]^M
      [ 4632.741856]  [<ffffffffa0ba2bed>] ptlrpc_hr_main+0x83d/0x8f0 [ptlrpc]^M
      [ 4632.750326]  [<ffffffff810c8345>] ? sched_clock_cpu+0x85/0xc0^M
      [ 4632.757970]  [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20^M
      [ 4632.765508]  [<ffffffffa0ba23b0>] ? ptlrpc_svcpt_stop_threads+0x590/0x590 [ptlrpc]^M
      [ 4632.775290]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0^M
      [ 4632.781973]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M
      [ 4632.790485]  [<ffffffff81697318>] ret_from_fork+0x58/0x90^M
      [ 4632.797662]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M
      

      System then crashed, crash dump is available on the node.
      vmcore-dmesg attached.

      Attachments

        Issue Links

          Activity

            [LU-9504] LBUG ptlrpc_handle_rs()) ASSERTION( lock != ((void *)0) ) failed

            I am still waiting for the failure to repeat.

            cliffw Cliff White (Inactive) added a comment - I am still waiting for the failure to repeat.
            laisiyao Lai Siyao added a comment -

            Cliff, could you post more messages of crash? I added some debug messages, which should be printed upon crash.

            laisiyao Lai Siyao added a comment - Cliff, could you post more messages of crash? I added some debug messages, which should be printed upon crash.

            Hit LU-9563 after first failover

            cliffw Cliff White (Inactive) added a comment - Hit LU-9563 after first failover

            I have loaded soak, will start tests today

            cliffw Cliff White (Inactive) added a comment - I have loaded soak, will start tests today
            laisiyao Lai Siyao added a comment -

            https://review.whamcloud.com/#/c/27207/ is updated, and some debug messages are added, Cliff, will you test again?

            laisiyao Lai Siyao added a comment - https://review.whamcloud.com/#/c/27207/ is updated, and some debug messages are added, Cliff, will you test again?

            Tested with latest version of patch
            https://review.whamcloud.com/#/c/27207/
            Hit LBUG immediately

            ay 31 17:11:18 soak-9 kernel: LustreError: 4177:0:(ldlm_lock.c:2548:ldlm_lock_downgrade()) ASSERTION( lock->l_granted_mode & (LCK_PW | LCK_EX) ) failed:
            May 31 17:11:18 soak-9 kernel: LustreError: 4177:0:(ldlm_lock.c:2548:ldlm_lock_downgrade()) LBUG
            May 31 17:11:18 soak-9 kernel: Pid: 4177, comm: ptlrpc_hr01_003
            May 31 17:11:18 soak-9 kernel: #012Call Trace:
            May 31 17:11:18 soak-9 kernel: [<ffffffffa08247ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
            May 31 17:11:19 soak-9 kernel: [<ffffffffa082487c>] lbug_with_loc+0x4c/0xb0 [libcfs]
            May 31 17:11:19 soak-9 kernel: [<ffffffffa0b6283b>] ldlm_lock_downgrade+0x19b/0x1d0 [ptlrpc]
            May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9620>] ptlrpc_handle_rs+0x3f0/0x640 [ptlrpc]
            May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9955>] ptlrpc_hr_main+0xe5/0x2c0 [ptlrpc]
            May 31 17:11:19 soak-9 kernel: [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20
            May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9870>] ? ptlrpc_hr_main+0x0/0x2c0 [ptlrpc]
            May 31 17:11:19 soak-9 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
            May 31 17:11:19 soak-9 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0
            May 31 17:11:19 soak-9 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90
            May 31 17:11:19 soak-9 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0
            May 31 17:11:19 soak-9 kernel:
            May 31 17:11:19 soak-9 kernel: Kernel panic - not syncing: LBUG
            
            cliffw Cliff White (Inactive) added a comment - Tested with latest version of patch https://review.whamcloud.com/#/c/27207/ Hit LBUG immediately ay 31 17:11:18 soak-9 kernel: LustreError: 4177:0:(ldlm_lock.c:2548:ldlm_lock_downgrade()) ASSERTION( lock->l_granted_mode & (LCK_PW | LCK_EX) ) failed: May 31 17:11:18 soak-9 kernel: LustreError: 4177:0:(ldlm_lock.c:2548:ldlm_lock_downgrade()) LBUG May 31 17:11:18 soak-9 kernel: Pid: 4177, comm: ptlrpc_hr01_003 May 31 17:11:18 soak-9 kernel: #012Call Trace: May 31 17:11:18 soak-9 kernel: [<ffffffffa08247ee>] libcfs_call_trace+0x4e/0x60 [libcfs] May 31 17:11:19 soak-9 kernel: [<ffffffffa082487c>] lbug_with_loc+0x4c/0xb0 [libcfs] May 31 17:11:19 soak-9 kernel: [<ffffffffa0b6283b>] ldlm_lock_downgrade+0x19b/0x1d0 [ptlrpc] May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9620>] ptlrpc_handle_rs+0x3f0/0x640 [ptlrpc] May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9955>] ptlrpc_hr_main+0xe5/0x2c0 [ptlrpc] May 31 17:11:19 soak-9 kernel: [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20 May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9870>] ? ptlrpc_hr_main+0x0/0x2c0 [ptlrpc] May 31 17:11:19 soak-9 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0 May 31 17:11:19 soak-9 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0 May 31 17:11:19 soak-9 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90 May 31 17:11:19 soak-9 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0 May 31 17:11:19 soak-9 kernel: May 31 17:11:19 soak-9 kernel: Kernel panic - not syncing: LBUG

            Hit this on soak with latest master. Patch should land to master.

            cliffw Cliff White (Inactive) added a comment - Hit this on soak with latest master. Patch should land to master.
            cliffw Cliff White (Inactive) added a comment - - edited

            Patch ran > 48 hours (weekend) No hard crashes, survived multiple failovers
            (14 OSS, 13 MDS) without a hard crash. Unfortunately detailed examination of results
            indicated a problem with the soak load engine, the failovers occurred by the clients were idle.

            After fixing the load issue, restarted soak, Had one client lockup, then hard server LBUG after second OST failover. Bug is LU-9547 - may not be related to this issue, we successfully performed two mds failovers.

            cliffw Cliff White (Inactive) added a comment - - edited Patch ran > 48 hours (weekend) No hard crashes, survived multiple failovers (14 OSS, 13 MDS) without a hard crash. Unfortunately detailed examination of results indicated a problem with the soak load engine, the failovers occurred by the clients were idle. After fixing the load issue, restarted soak, Had one client lockup, then hard server LBUG after second OST failover. Bug is LU-9547 - may not be related to this issue, we successfully performed two mds failovers.

            Patch has survived for three hours, one mds_restart one oss_restart and one router_delay. We'll see how it does over the weekend, but so far good.

            cliffw Cliff White (Inactive) added a comment - Patch has survived for three hours, one mds_restart one oss_restart and one router_delay. We'll see how it does over the weekend, but so far good.

            Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/27207
            Subject: LU-9504 ptlrpc: REP-ACK hr may race with trans commit
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: fcd7bdd08ec2c23d1e7557788a42e04b97590e02

            gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/27207 Subject: LU-9504 ptlrpc: REP-ACK hr may race with trans commit Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: fcd7bdd08ec2c23d1e7557788a42e04b97590e02
            laisiyao Lai Siyao added a comment -

            This assert is from https://review.whamcloud.com/22807 for LU-8650, which assumes all the handling of a reply will be in one specific hr thread, so it doesn't need to consider race, now it looks to be not true, I'll review the code to verify it.

            laisiyao Lai Siyao added a comment - This assert is from https://review.whamcloud.com/22807 for LU-8650 , which assumes all the handling of a reply will be in one specific hr thread, so it doesn't need to consider race, now it looks to be not true, I'll review the code to verify it.

            People

              laisiyao Lai Siyao
              cliffw Cliff White (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: