Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9504

LBUG ptlrpc_handle_rs()) ASSERTION( lock != ((void *)0) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.10.0
    • Lustre 2.10.0
    • Soak stress cluster
    • 3
    • 9223372036854775807

    Description

      Loaded latest master build, started soak. Only fault induced was router drop.
      Soak-9 is the second MDS (MDT0001)
      soak-9 has a hard crash in normal operation:
      Soak is started, mount completes

      [  893.779340] LustreError: 11-0: soaked-OST000c-osc-MDT0001: operation ost_connect to node 192.168.1.102@o2ib10 failed: rc = -16^M
      [  893.801270] LustreError: Skipped 139 previous similar messages^M
      [  894.095801] Lustre: soaked-MDT0003-osp-MDT0001: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)^M
      [  894.110919] Lustre: Skipped 7 previous similar messages^M
      [  894.253639] Lustre: soaked-MDT0001: recovery is timed out, evict stale exports^M
      [  894.265333] Lustre: soaked-MDT0001: disconnecting 28 stale clients^M
      [  894.278063] Lustre: soaked-MDT0001: Recovery over after 5:01, of 31 clients 3 recovered and 28 were evicted.^M
      [ 1465.548946] Lustre: soaked-MDT0001: Client 8e539072-a775-2171-7825-433ade3d0c39 (at 192.168.1.132@o2ib100) reconnecting^M
      [ 1465.563698] Lustre: soaked-MDT0001: Connection restored to 8e539072-a775-2171-7825-433ade3d0c39 (at 192.168.1.132@o2ib100)^M
      [ 1465.579478] Lustre: Skipped 30 previous similar messages^M
      [ 1757.516066] Lustre: soaked-MDT0001: Client 6fb4512c-e89d-8233-88ad-b696d11c9821 (at 192.168.1.138@o2ib100) reconnecting^M
      [ 1757.531619] Lustre: soaked-MDT0001: Connection restored to 6fb4512c-e89d-8233-88ad-b696d11c9821 (at 192.168.1.138@o2ib100)^M
      [ 1904.507160] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
      [ 1942.000624] Lustre: soaked-MDT0001: Client 0a5dee9d-f606-8dd1-e9d4-d42a75c735e1 (at 192.168.1.129@o2ib100) reconnecting^M
      [ 2436.463475] Lustre: soaked-MDT0001: Client 2d2ea61b-cf5f-add6-6e06-30458f85a726 (at 192.168.1.139@o2ib100) reconnecting^M
      [ 2436.478139] Lustre: soaked-MDT0001: Connection restored to 2d2ea61b-cf5f-add6-6e06-30458f85a726 (at 192.168.1.139@o2ib100)^M
      [ 2436.493810] Lustre: Skipped 2 previous similar messages^M
      [ 2733.438247] Lustre: soaked-MDT0001: Client cb715667-fb6c-b895-e632-274b232c5bc9 (at 192.168.1.119@o2ib100) reconnecting^M
      [ 3117.401693] Lustre: soaked-MDT0001: Client 0a5dee9d-f606-8dd1-e9d4-d42a75c735e1 (at 192.168.1.129@o2ib100) reconnecting^M
      [ 3117.416360] Lustre: soaked-MDT0001: Connection restored to 0a5dee9d-f606-8dd1-e9d4-d42a75c735e1 (at 192.168.1.129@o2ib100)^M
      [ 3117.430856] Lustre: Skipped 1 previous similar message^M
      [ 3359.388740] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
      [ 3497.321905] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
      [ 3951.338201] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
      [ 3951.353148] Lustre: soaked-MDT0001: Connection restored to 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100)^M
      [ 3951.368647] Lustre: Skipped 2 previous similar messages^M
      [ 4415.324113] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
      [ 4501.233379] Lustre: 3876:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1494873743/real 1494873743]  req@ffff8803b3329b00 x1567484370368192/t0(0) o104->soaked-MDT0001@192.168.1.139@o2ib100:15/16 lens 296/224 e 0 to 1 dl 1494873750 ref 1 fl Rpc:X/0/ffffffff rc 0/-1^M
      [ 4501.273401] Lustre: 3876:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages^M
      [ 4632.525998] LustreError: 3049:0:(service.c:2229:ptlrpc_handle_rs()) ASSERTION( lock != ((void *)0) ) failed: ^M
      [ 4632.539820] LustreError: 3049:0:(service.c:2229:ptlrpc_handle_rs()) LBUG^M
      [ 4632.550400] Pid: 3049, comm: ptlrpc_hr01_004^M
      [ 4632.557537] ^M
      [ 4632.557537] Call Trace:^M
      [ 4632.566532]  [<ffffffffa080e7ee>] libcfs_call_trace+0x4e/0x60 [libcfs]^M
      [ 4632.575923]  [<ffffffffa080e87c>] lbug_with_loc+0x4c/0xb0 [libcfs]^M
      [ 4632.585006]  [<ffffffffa0ba2bed>] ptlrpc_hr_main+0x83d/0x8f0 [ptlrpc]^M
      [ 4632.594107]  [<ffffffff810c8345>] ? sched_clock_cpu+0x85/0xc0^M
      [ 4632.602416]  [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20^M
      [ 4632.611086]  [<ffffffffa0ba23b0>] ? ptlrpc_hr_main+0x0/0x8f0 [ptlrpc]^M
      [ 4632.620009]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0^M
      [ 4632.627027]  [<ffffffff810b0980>] ? kthread+0x0/0xe0^M
      [ 4632.634106]  [<ffffffff81697318>] ret_from_fork+0x58/0x90^M
      [ 4632.641571]  [<ffffffff810b0980>] ? kthread+0x0/0xe0^M
      [ 4632.648510] ^M
      [ 4632.651470] Kernel panic - not syncing: LBUG^M
      [ 4632.657497] CPU: 11 PID: 3049 Comm: ptlrpc_hr01_004 Tainted: P           OE  ------------   3.10.0-514.16.1.el7_lustre.x86_64 #1^M
      [ 4632.673033] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [ 4632.686895]  ffffffffa082cdac 00000000329e47aa ffff8808212b7d30 ffffffff81686d1f^M
      [ 4632.696571]  ffff8808212b7db0 ffffffff8168014a ffffffff00000008 ffff8808212b7dc0^M
      [ 4632.706233]  ffff8808212b7d60 00000000329e47aa 00000000329e47aa ffff88082d8cf838^M
      [ 4632.715875] Call Trace:^M
      [ 4632.715875] Call Trace:^M
      [ 4632.719901]  [<ffffffff81686d1f>] dump_stack+0x19/0x1b^M
      [ 4632.726949]  [<ffffffff8168014a>] panic+0xe3/0x1f2^M
      [ 4632.733611]  [<ffffffffa080e894>] lbug_with_loc+0x64/0xb0 [libcfs]^M
      [ 4632.741856]  [<ffffffffa0ba2bed>] ptlrpc_hr_main+0x83d/0x8f0 [ptlrpc]^M
      [ 4632.750326]  [<ffffffff810c8345>] ? sched_clock_cpu+0x85/0xc0^M
      [ 4632.757970]  [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20^M
      [ 4632.765508]  [<ffffffffa0ba23b0>] ? ptlrpc_svcpt_stop_threads+0x590/0x590 [ptlrpc]^M
      [ 4632.775290]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0^M
      [ 4632.781973]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M
      [ 4632.790485]  [<ffffffff81697318>] ret_from_fork+0x58/0x90^M
      [ 4632.797662]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M
      

      System then crashed, crash dump is available on the node.
      vmcore-dmesg attached.

      Attachments

        Issue Links

          Activity

            [LU-9504] LBUG ptlrpc_handle_rs()) ASSERTION( lock != ((void *)0) ) failed
            laisiyao Lai Siyao added a comment -

            Cliff, could you post more messages of crash? I added some debug messages, which should be printed upon crash.

            laisiyao Lai Siyao added a comment - Cliff, could you post more messages of crash? I added some debug messages, which should be printed upon crash.

            Hit LU-9563 after first failover

            cliffw Cliff White (Inactive) added a comment - Hit LU-9563 after first failover

            I have loaded soak, will start tests today

            cliffw Cliff White (Inactive) added a comment - I have loaded soak, will start tests today
            laisiyao Lai Siyao added a comment -

            https://review.whamcloud.com/#/c/27207/ is updated, and some debug messages are added, Cliff, will you test again?

            laisiyao Lai Siyao added a comment - https://review.whamcloud.com/#/c/27207/ is updated, and some debug messages are added, Cliff, will you test again?

            Tested with latest version of patch
            https://review.whamcloud.com/#/c/27207/
            Hit LBUG immediately

            ay 31 17:11:18 soak-9 kernel: LustreError: 4177:0:(ldlm_lock.c:2548:ldlm_lock_downgrade()) ASSERTION( lock->l_granted_mode & (LCK_PW | LCK_EX) ) failed:
            May 31 17:11:18 soak-9 kernel: LustreError: 4177:0:(ldlm_lock.c:2548:ldlm_lock_downgrade()) LBUG
            May 31 17:11:18 soak-9 kernel: Pid: 4177, comm: ptlrpc_hr01_003
            May 31 17:11:18 soak-9 kernel: #012Call Trace:
            May 31 17:11:18 soak-9 kernel: [<ffffffffa08247ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
            May 31 17:11:19 soak-9 kernel: [<ffffffffa082487c>] lbug_with_loc+0x4c/0xb0 [libcfs]
            May 31 17:11:19 soak-9 kernel: [<ffffffffa0b6283b>] ldlm_lock_downgrade+0x19b/0x1d0 [ptlrpc]
            May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9620>] ptlrpc_handle_rs+0x3f0/0x640 [ptlrpc]
            May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9955>] ptlrpc_hr_main+0xe5/0x2c0 [ptlrpc]
            May 31 17:11:19 soak-9 kernel: [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20
            May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9870>] ? ptlrpc_hr_main+0x0/0x2c0 [ptlrpc]
            May 31 17:11:19 soak-9 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
            May 31 17:11:19 soak-9 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0
            May 31 17:11:19 soak-9 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90
            May 31 17:11:19 soak-9 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0
            May 31 17:11:19 soak-9 kernel:
            May 31 17:11:19 soak-9 kernel: Kernel panic - not syncing: LBUG
            
            cliffw Cliff White (Inactive) added a comment - Tested with latest version of patch https://review.whamcloud.com/#/c/27207/ Hit LBUG immediately ay 31 17:11:18 soak-9 kernel: LustreError: 4177:0:(ldlm_lock.c:2548:ldlm_lock_downgrade()) ASSERTION( lock->l_granted_mode & (LCK_PW | LCK_EX) ) failed: May 31 17:11:18 soak-9 kernel: LustreError: 4177:0:(ldlm_lock.c:2548:ldlm_lock_downgrade()) LBUG May 31 17:11:18 soak-9 kernel: Pid: 4177, comm: ptlrpc_hr01_003 May 31 17:11:18 soak-9 kernel: #012Call Trace: May 31 17:11:18 soak-9 kernel: [<ffffffffa08247ee>] libcfs_call_trace+0x4e/0x60 [libcfs] May 31 17:11:19 soak-9 kernel: [<ffffffffa082487c>] lbug_with_loc+0x4c/0xb0 [libcfs] May 31 17:11:19 soak-9 kernel: [<ffffffffa0b6283b>] ldlm_lock_downgrade+0x19b/0x1d0 [ptlrpc] May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9620>] ptlrpc_handle_rs+0x3f0/0x640 [ptlrpc] May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9955>] ptlrpc_hr_main+0xe5/0x2c0 [ptlrpc] May 31 17:11:19 soak-9 kernel: [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20 May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9870>] ? ptlrpc_hr_main+0x0/0x2c0 [ptlrpc] May 31 17:11:19 soak-9 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0 May 31 17:11:19 soak-9 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0 May 31 17:11:19 soak-9 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90 May 31 17:11:19 soak-9 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0 May 31 17:11:19 soak-9 kernel: May 31 17:11:19 soak-9 kernel: Kernel panic - not syncing: LBUG

            Hit this on soak with latest master. Patch should land to master.

            cliffw Cliff White (Inactive) added a comment - Hit this on soak with latest master. Patch should land to master.
            cliffw Cliff White (Inactive) added a comment - - edited

            Patch ran > 48 hours (weekend) No hard crashes, survived multiple failovers
            (14 OSS, 13 MDS) without a hard crash. Unfortunately detailed examination of results
            indicated a problem with the soak load engine, the failovers occurred by the clients were idle.

            After fixing the load issue, restarted soak, Had one client lockup, then hard server LBUG after second OST failover. Bug is LU-9547 - may not be related to this issue, we successfully performed two mds failovers.

            cliffw Cliff White (Inactive) added a comment - - edited Patch ran > 48 hours (weekend) No hard crashes, survived multiple failovers (14 OSS, 13 MDS) without a hard crash. Unfortunately detailed examination of results indicated a problem with the soak load engine, the failovers occurred by the clients were idle. After fixing the load issue, restarted soak, Had one client lockup, then hard server LBUG after second OST failover. Bug is LU-9547 - may not be related to this issue, we successfully performed two mds failovers.

            Patch has survived for three hours, one mds_restart one oss_restart and one router_delay. We'll see how it does over the weekend, but so far good.

            cliffw Cliff White (Inactive) added a comment - Patch has survived for three hours, one mds_restart one oss_restart and one router_delay. We'll see how it does over the weekend, but so far good.

            Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/27207
            Subject: LU-9504 ptlrpc: REP-ACK hr may race with trans commit
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: fcd7bdd08ec2c23d1e7557788a42e04b97590e02

            gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/27207 Subject: LU-9504 ptlrpc: REP-ACK hr may race with trans commit Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: fcd7bdd08ec2c23d1e7557788a42e04b97590e02
            laisiyao Lai Siyao added a comment -

            This assert is from https://review.whamcloud.com/22807 for LU-8650, which assumes all the handling of a reply will be in one specific hr thread, so it doesn't need to consider race, now it looks to be not true, I'll review the code to verify it.

            laisiyao Lai Siyao added a comment - This assert is from https://review.whamcloud.com/22807 for LU-8650 , which assumes all the handling of a reply will be in one specific hr thread, so it doesn't need to consider race, now it looks to be not true, I'll review the code to verify it.
            pjones Peter Jones added a comment -

            To translate that into git commits

            LU-8943 lnd: Enable Multiple OPA Endpoints between Nodes — oleg.drokin / gitweb
            LU-5788 test: fix the cmd to getfree space — oleg.drokin / gitweb
            LU-9324 lfs: output stripe info in YAML format — oleg.drokin / gitweb
            LU-9357 pfl: should inherit pool from previous layout comp — oleg.drokin / gitweb
            LU-9372 ptlrpc: drain "ptlrpc_request_buffer_desc" objects — oleg.drokin / gitweb
            LU-9324 lnet: move cyaml.h under lnet/include/ — oleg.drokin / gitweb
            LU-7108 test: Remove sanityn tests from ALWAYS_EXCEPT list — oleg.drokin / gitweb
            LU-7567 utils: fix timestamp printing in lfs_changelog() — oleg.drokin / gitweb
            LU-8346 obdclass: guarantee all keys filled — oleg.drokin / gitweb
            LU-9439 scripts: Change behavior of lustre_rmmod — oleg.drokin / gitweb
            LU-9456 lnd: Change sock_create() to sock_create_kern() — oleg.drokin / gitweb
            LU-8294 gss: quiet cache_check return ENOENT warning — oleg.drokin / gitweb
            LU-9458 ptlrpc: handle case of epp_free_pages <= PTLRPC_MAX_BRW_PAGES — oleg.drokin / gitweb
            LU-9096 test: add sanity 253 to ALWAYS_EXCEPT — oleg.drokin / gitweb

            pjones Peter Jones added a comment - To translate that into git commits LU-8943 lnd: Enable Multiple OPA Endpoints between Nodes — oleg.drokin / gitweb LU-5788 test: fix the cmd to getfree space — oleg.drokin / gitweb LU-9324 lfs: output stripe info in YAML format — oleg.drokin / gitweb LU-9357 pfl: should inherit pool from previous layout comp — oleg.drokin / gitweb LU-9372 ptlrpc: drain "ptlrpc_request_buffer_desc" objects — oleg.drokin / gitweb LU-9324 lnet: move cyaml.h under lnet/include/ — oleg.drokin / gitweb LU-7108 test: Remove sanityn tests from ALWAYS_EXCEPT list — oleg.drokin / gitweb LU-7567 utils: fix timestamp printing in lfs_changelog() — oleg.drokin / gitweb LU-8346 obdclass: guarantee all keys filled — oleg.drokin / gitweb LU-9439 scripts: Change behavior of lustre_rmmod — oleg.drokin / gitweb LU-9456 lnd: Change sock_create() to sock_create_kern() — oleg.drokin / gitweb LU-8294 gss: quiet cache_check return ENOENT warning — oleg.drokin / gitweb LU-9458 ptlrpc: handle case of epp_free_pages <= PTLRPC_MAX_BRW_PAGES — oleg.drokin / gitweb LU-9096 test: add sanity 253 to ALWAYS_EXCEPT — oleg.drokin / gitweb

            People

              laisiyao Lai Siyao
              cliffw Cliff White (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: