Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9504

LBUG ptlrpc_handle_rs()) ASSERTION( lock != ((void *)0) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.10.0
    • Lustre 2.10.0
    • Soak stress cluster
    • 3
    • 9223372036854775807

    Description

      Loaded latest master build, started soak. Only fault induced was router drop.
      Soak-9 is the second MDS (MDT0001)
      soak-9 has a hard crash in normal operation:
      Soak is started, mount completes

      [  893.779340] LustreError: 11-0: soaked-OST000c-osc-MDT0001: operation ost_connect to node 192.168.1.102@o2ib10 failed: rc = -16^M
      [  893.801270] LustreError: Skipped 139 previous similar messages^M
      [  894.095801] Lustre: soaked-MDT0003-osp-MDT0001: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)^M
      [  894.110919] Lustre: Skipped 7 previous similar messages^M
      [  894.253639] Lustre: soaked-MDT0001: recovery is timed out, evict stale exports^M
      [  894.265333] Lustre: soaked-MDT0001: disconnecting 28 stale clients^M
      [  894.278063] Lustre: soaked-MDT0001: Recovery over after 5:01, of 31 clients 3 recovered and 28 were evicted.^M
      [ 1465.548946] Lustre: soaked-MDT0001: Client 8e539072-a775-2171-7825-433ade3d0c39 (at 192.168.1.132@o2ib100) reconnecting^M
      [ 1465.563698] Lustre: soaked-MDT0001: Connection restored to 8e539072-a775-2171-7825-433ade3d0c39 (at 192.168.1.132@o2ib100)^M
      [ 1465.579478] Lustre: Skipped 30 previous similar messages^M
      [ 1757.516066] Lustre: soaked-MDT0001: Client 6fb4512c-e89d-8233-88ad-b696d11c9821 (at 192.168.1.138@o2ib100) reconnecting^M
      [ 1757.531619] Lustre: soaked-MDT0001: Connection restored to 6fb4512c-e89d-8233-88ad-b696d11c9821 (at 192.168.1.138@o2ib100)^M
      [ 1904.507160] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
      [ 1942.000624] Lustre: soaked-MDT0001: Client 0a5dee9d-f606-8dd1-e9d4-d42a75c735e1 (at 192.168.1.129@o2ib100) reconnecting^M
      [ 2436.463475] Lustre: soaked-MDT0001: Client 2d2ea61b-cf5f-add6-6e06-30458f85a726 (at 192.168.1.139@o2ib100) reconnecting^M
      [ 2436.478139] Lustre: soaked-MDT0001: Connection restored to 2d2ea61b-cf5f-add6-6e06-30458f85a726 (at 192.168.1.139@o2ib100)^M
      [ 2436.493810] Lustre: Skipped 2 previous similar messages^M
      [ 2733.438247] Lustre: soaked-MDT0001: Client cb715667-fb6c-b895-e632-274b232c5bc9 (at 192.168.1.119@o2ib100) reconnecting^M
      [ 3117.401693] Lustre: soaked-MDT0001: Client 0a5dee9d-f606-8dd1-e9d4-d42a75c735e1 (at 192.168.1.129@o2ib100) reconnecting^M
      [ 3117.416360] Lustre: soaked-MDT0001: Connection restored to 0a5dee9d-f606-8dd1-e9d4-d42a75c735e1 (at 192.168.1.129@o2ib100)^M
      [ 3117.430856] Lustre: Skipped 1 previous similar message^M
      [ 3359.388740] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
      [ 3497.321905] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
      [ 3951.338201] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
      [ 3951.353148] Lustre: soaked-MDT0001: Connection restored to 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100)^M
      [ 3951.368647] Lustre: Skipped 2 previous similar messages^M
      [ 4415.324113] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
      [ 4501.233379] Lustre: 3876:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1494873743/real 1494873743]  req@ffff8803b3329b00 x1567484370368192/t0(0) o104->soaked-MDT0001@192.168.1.139@o2ib100:15/16 lens 296/224 e 0 to 1 dl 1494873750 ref 1 fl Rpc:X/0/ffffffff rc 0/-1^M
      [ 4501.273401] Lustre: 3876:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages^M
      [ 4632.525998] LustreError: 3049:0:(service.c:2229:ptlrpc_handle_rs()) ASSERTION( lock != ((void *)0) ) failed: ^M
      [ 4632.539820] LustreError: 3049:0:(service.c:2229:ptlrpc_handle_rs()) LBUG^M
      [ 4632.550400] Pid: 3049, comm: ptlrpc_hr01_004^M
      [ 4632.557537] ^M
      [ 4632.557537] Call Trace:^M
      [ 4632.566532]  [<ffffffffa080e7ee>] libcfs_call_trace+0x4e/0x60 [libcfs]^M
      [ 4632.575923]  [<ffffffffa080e87c>] lbug_with_loc+0x4c/0xb0 [libcfs]^M
      [ 4632.585006]  [<ffffffffa0ba2bed>] ptlrpc_hr_main+0x83d/0x8f0 [ptlrpc]^M
      [ 4632.594107]  [<ffffffff810c8345>] ? sched_clock_cpu+0x85/0xc0^M
      [ 4632.602416]  [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20^M
      [ 4632.611086]  [<ffffffffa0ba23b0>] ? ptlrpc_hr_main+0x0/0x8f0 [ptlrpc]^M
      [ 4632.620009]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0^M
      [ 4632.627027]  [<ffffffff810b0980>] ? kthread+0x0/0xe0^M
      [ 4632.634106]  [<ffffffff81697318>] ret_from_fork+0x58/0x90^M
      [ 4632.641571]  [<ffffffff810b0980>] ? kthread+0x0/0xe0^M
      [ 4632.648510] ^M
      [ 4632.651470] Kernel panic - not syncing: LBUG^M
      [ 4632.657497] CPU: 11 PID: 3049 Comm: ptlrpc_hr01_004 Tainted: P           OE  ------------   3.10.0-514.16.1.el7_lustre.x86_64 #1^M
      [ 4632.673033] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [ 4632.686895]  ffffffffa082cdac 00000000329e47aa ffff8808212b7d30 ffffffff81686d1f^M
      [ 4632.696571]  ffff8808212b7db0 ffffffff8168014a ffffffff00000008 ffff8808212b7dc0^M
      [ 4632.706233]  ffff8808212b7d60 00000000329e47aa 00000000329e47aa ffff88082d8cf838^M
      [ 4632.715875] Call Trace:^M
      [ 4632.715875] Call Trace:^M
      [ 4632.719901]  [<ffffffff81686d1f>] dump_stack+0x19/0x1b^M
      [ 4632.726949]  [<ffffffff8168014a>] panic+0xe3/0x1f2^M
      [ 4632.733611]  [<ffffffffa080e894>] lbug_with_loc+0x64/0xb0 [libcfs]^M
      [ 4632.741856]  [<ffffffffa0ba2bed>] ptlrpc_hr_main+0x83d/0x8f0 [ptlrpc]^M
      [ 4632.750326]  [<ffffffff810c8345>] ? sched_clock_cpu+0x85/0xc0^M
      [ 4632.757970]  [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20^M
      [ 4632.765508]  [<ffffffffa0ba23b0>] ? ptlrpc_svcpt_stop_threads+0x590/0x590 [ptlrpc]^M
      [ 4632.775290]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0^M
      [ 4632.781973]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M
      [ 4632.790485]  [<ffffffff81697318>] ret_from_fork+0x58/0x90^M
      [ 4632.797662]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M
      

      System then crashed, crash dump is available on the node.
      vmcore-dmesg attached.

      Attachments

        Issue Links

          Activity

            [LU-9504] LBUG ptlrpc_handle_rs()) ASSERTION( lock != ((void *)0) ) failed
            laisiyao Lai Siyao added a comment - - edited

            It's strange I don't see debug messages before LBUG(), Cliff, which update from https://review.whamcloud.com/#/c/27207/ did you use in this test? The debug messages were added since Jun 3 (update 5). And if possible can you retest with the update 7? which added more debug message right before LBUG.

            laisiyao Lai Siyao added a comment - - edited It's strange I don't see debug messages before LBUG(), Cliff, which update from https://review.whamcloud.com/#/c/27207/ did you use in this test? The debug messages were added since Jun 3 (update 5). And if possible can you retest with the update 7? which added more debug message right before LBUG.

            Hit LBUG again, syslog and lustre-logs from dead server attached

            cliffw Cliff White (Inactive) added a comment - Hit LBUG again, syslog and lustre-logs from dead server attached

            Here is the log from prior to the LBUG.

            Jun  5 16:44:31 soak-11 kernel: Lustre: Skipped 1 previous similar message
            Jun  5 16:44:40 soak-11 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.124@o2ib100 (stopping)
            Jun  5 16:44:40 soak-11 kernel: Lustre: Skipped 5 previous similar messages
            Jun  5 16:44:44 soak-11 kernel: LustreError: 0-0: Forced cleanup waiting for mdt-soaked-MDT0002_UUID namespace with 2 resources in use, (rc=-110)
            Jun  5 16:45:02 soak-11 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.136@o2ib100 (stopping)
            Jun  5 16:45:02 soak-11 kernel: Lustre: Skipped 20 previous similar messages
            Jun  5 16:45:07 soak-11 kernel: Lustre: 4229:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1496681101/real 0]  req@ffff8807e7265700 x1569380820308784/t0(0) o38->soaked-MDT0002-osp-MDT0003@192.168.1.110@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1496681107 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
            Jun  5 16:45:09 soak-11 kernel: LustreError: 0-0: Forced cleanup waiting for mdt-soaked-MDT0002_UUID namespace with 2 resources in use, (rc=-110)
            Jun  5 16:45:34 soak-11 kernel: LustreError: 0-0: Forced cleanup waiting for mdt-soaked-MDT0002_UUID namespace with 2 resources in use, (rc=-110)
            Jun  5 16:45:41 soak-11 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.102@o2ib10 (stopping)
            Jun  5 16:45:41 soak-11 kernel: Lustre: Skipped 11 previous similar messages
            Jun  5 16:45:51 soak-11 kernel: LustreError: 5001:0:(lod_qos.c:208:lod_statfs_and_check()) soaked-MDT0002-mdtlov: statfs: rc = -108
            Jun  5 16:45:51 soak-11 kernel: Lustre: 5001:0:(service.c:2117:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (31:69s); client may timeout.  req@ffff8803f1b0ef00 x1569380804405664/t0(0) o101->bd5dff2c-b3c7-e6f9-ab0e-ec6b3d83954a@192.168.1.117@o2ib100:37/0 lens 1728/544 e 1 to 0 dl 1496681082 ref 1 fl Complete:/0/0 rc -19/-19
            Jun  5 16:45:54 soak-11 kernel: Lustre: server umount soaked-MDT0002 complete
            Jun  5 16:45:54 soak-11 sshd[5502]: Received disconnect from 10.10.1.135: 11: disconnected by user
            Jun  5 16:45:54 soak-11 sshd[5502]: pam_unix(sshd:session): session closed for user root
            Jun  5 16:45:54 soak-11 systemd-logind: Removed session 15.
            Jun  5 16:45:54 soak-11 systemd: Removed slice user-0.slice.
            Jun  5 16:45:54 soak-11 systemd: Stopping user-0.slice.
            Jun  5 16:45:56 soak-11 kernel: Lustre: soaked-MDT0003: Connection restored to 192.168.1.110@o2ib10 (at 192.168.1.110@o2ib10)
            Jun  5 16:45:56 soak-11 kernel: Lustre: Skipped 26 previous similar messages
            Jun  5 16:51:12 soak-11 kernel: LustreError: 4204:0:(ldlm_lock.c:2548:ldlm_lock_downgrade()) ASSERTION( lock->l_granted_mode & (LCK_PW | LCK_EX) ) failed:
            
            cliffw Cliff White (Inactive) added a comment - Here is the log from prior to the LBUG. Jun 5 16:44:31 soak-11 kernel: Lustre: Skipped 1 previous similar message Jun 5 16:44:40 soak-11 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.124@o2ib100 (stopping) Jun 5 16:44:40 soak-11 kernel: Lustre: Skipped 5 previous similar messages Jun 5 16:44:44 soak-11 kernel: LustreError: 0-0: Forced cleanup waiting for mdt-soaked-MDT0002_UUID namespace with 2 resources in use, (rc=-110) Jun 5 16:45:02 soak-11 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.136@o2ib100 (stopping) Jun 5 16:45:02 soak-11 kernel: Lustre: Skipped 20 previous similar messages Jun 5 16:45:07 soak-11 kernel: Lustre: 4229:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1496681101/real 0] req@ffff8807e7265700 x1569380820308784/t0(0) o38->soaked-MDT0002-osp-MDT0003@192.168.1.110@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1496681107 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Jun 5 16:45:09 soak-11 kernel: LustreError: 0-0: Forced cleanup waiting for mdt-soaked-MDT0002_UUID namespace with 2 resources in use, (rc=-110) Jun 5 16:45:34 soak-11 kernel: LustreError: 0-0: Forced cleanup waiting for mdt-soaked-MDT0002_UUID namespace with 2 resources in use, (rc=-110) Jun 5 16:45:41 soak-11 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.102@o2ib10 (stopping) Jun 5 16:45:41 soak-11 kernel: Lustre: Skipped 11 previous similar messages Jun 5 16:45:51 soak-11 kernel: LustreError: 5001:0:(lod_qos.c:208:lod_statfs_and_check()) soaked-MDT0002-mdtlov: statfs: rc = -108 Jun 5 16:45:51 soak-11 kernel: Lustre: 5001:0:(service.c:2117:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (31:69s); client may timeout. req@ffff8803f1b0ef00 x1569380804405664/t0(0) o101->bd5dff2c-b3c7-e6f9-ab0e-ec6b3d83954a@192.168.1.117@o2ib100:37/0 lens 1728/544 e 1 to 0 dl 1496681082 ref 1 fl Complete:/0/0 rc -19/-19 Jun 5 16:45:54 soak-11 kernel: Lustre: server umount soaked-MDT0002 complete Jun 5 16:45:54 soak-11 sshd[5502]: Received disconnect from 10.10.1.135: 11: disconnected by user Jun 5 16:45:54 soak-11 sshd[5502]: pam_unix(sshd:session): session closed for user root Jun 5 16:45:54 soak-11 systemd-logind: Removed session 15. Jun 5 16:45:54 soak-11 systemd: Removed slice user-0.slice. Jun 5 16:45:54 soak-11 systemd: Stopping user-0.slice. Jun 5 16:45:56 soak-11 kernel: Lustre: soaked-MDT0003: Connection restored to 192.168.1.110@o2ib10 (at 192.168.1.110@o2ib10) Jun 5 16:45:56 soak-11 kernel: Lustre: Skipped 26 previous similar messages Jun 5 16:51:12 soak-11 kernel: LustreError: 4204:0:(ldlm_lock.c:2548:ldlm_lock_downgrade()) ASSERTION( lock->l_granted_mode & (LCK_PW | LCK_EX) ) failed:

            I am still waiting for the failure to repeat.

            cliffw Cliff White (Inactive) added a comment - I am still waiting for the failure to repeat.
            laisiyao Lai Siyao added a comment -

            Cliff, could you post more messages of crash? I added some debug messages, which should be printed upon crash.

            laisiyao Lai Siyao added a comment - Cliff, could you post more messages of crash? I added some debug messages, which should be printed upon crash.

            Hit LU-9563 after first failover

            cliffw Cliff White (Inactive) added a comment - Hit LU-9563 after first failover

            I have loaded soak, will start tests today

            cliffw Cliff White (Inactive) added a comment - I have loaded soak, will start tests today
            laisiyao Lai Siyao added a comment -

            https://review.whamcloud.com/#/c/27207/ is updated, and some debug messages are added, Cliff, will you test again?

            laisiyao Lai Siyao added a comment - https://review.whamcloud.com/#/c/27207/ is updated, and some debug messages are added, Cliff, will you test again?

            Tested with latest version of patch
            https://review.whamcloud.com/#/c/27207/
            Hit LBUG immediately

            ay 31 17:11:18 soak-9 kernel: LustreError: 4177:0:(ldlm_lock.c:2548:ldlm_lock_downgrade()) ASSERTION( lock->l_granted_mode & (LCK_PW | LCK_EX) ) failed:
            May 31 17:11:18 soak-9 kernel: LustreError: 4177:0:(ldlm_lock.c:2548:ldlm_lock_downgrade()) LBUG
            May 31 17:11:18 soak-9 kernel: Pid: 4177, comm: ptlrpc_hr01_003
            May 31 17:11:18 soak-9 kernel: #012Call Trace:
            May 31 17:11:18 soak-9 kernel: [<ffffffffa08247ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
            May 31 17:11:19 soak-9 kernel: [<ffffffffa082487c>] lbug_with_loc+0x4c/0xb0 [libcfs]
            May 31 17:11:19 soak-9 kernel: [<ffffffffa0b6283b>] ldlm_lock_downgrade+0x19b/0x1d0 [ptlrpc]
            May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9620>] ptlrpc_handle_rs+0x3f0/0x640 [ptlrpc]
            May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9955>] ptlrpc_hr_main+0xe5/0x2c0 [ptlrpc]
            May 31 17:11:19 soak-9 kernel: [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20
            May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9870>] ? ptlrpc_hr_main+0x0/0x2c0 [ptlrpc]
            May 31 17:11:19 soak-9 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
            May 31 17:11:19 soak-9 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0
            May 31 17:11:19 soak-9 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90
            May 31 17:11:19 soak-9 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0
            May 31 17:11:19 soak-9 kernel:
            May 31 17:11:19 soak-9 kernel: Kernel panic - not syncing: LBUG
            
            cliffw Cliff White (Inactive) added a comment - Tested with latest version of patch https://review.whamcloud.com/#/c/27207/ Hit LBUG immediately ay 31 17:11:18 soak-9 kernel: LustreError: 4177:0:(ldlm_lock.c:2548:ldlm_lock_downgrade()) ASSERTION( lock->l_granted_mode & (LCK_PW | LCK_EX) ) failed: May 31 17:11:18 soak-9 kernel: LustreError: 4177:0:(ldlm_lock.c:2548:ldlm_lock_downgrade()) LBUG May 31 17:11:18 soak-9 kernel: Pid: 4177, comm: ptlrpc_hr01_003 May 31 17:11:18 soak-9 kernel: #012Call Trace: May 31 17:11:18 soak-9 kernel: [<ffffffffa08247ee>] libcfs_call_trace+0x4e/0x60 [libcfs] May 31 17:11:19 soak-9 kernel: [<ffffffffa082487c>] lbug_with_loc+0x4c/0xb0 [libcfs] May 31 17:11:19 soak-9 kernel: [<ffffffffa0b6283b>] ldlm_lock_downgrade+0x19b/0x1d0 [ptlrpc] May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9620>] ptlrpc_handle_rs+0x3f0/0x640 [ptlrpc] May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9955>] ptlrpc_hr_main+0xe5/0x2c0 [ptlrpc] May 31 17:11:19 soak-9 kernel: [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20 May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9870>] ? ptlrpc_hr_main+0x0/0x2c0 [ptlrpc] May 31 17:11:19 soak-9 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0 May 31 17:11:19 soak-9 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0 May 31 17:11:19 soak-9 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90 May 31 17:11:19 soak-9 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0 May 31 17:11:19 soak-9 kernel: May 31 17:11:19 soak-9 kernel: Kernel panic - not syncing: LBUG

            Hit this on soak with latest master. Patch should land to master.

            cliffw Cliff White (Inactive) added a comment - Hit this on soak with latest master. Patch should land to master.
            cliffw Cliff White (Inactive) added a comment - - edited

            Patch ran > 48 hours (weekend) No hard crashes, survived multiple failovers
            (14 OSS, 13 MDS) without a hard crash. Unfortunately detailed examination of results
            indicated a problem with the soak load engine, the failovers occurred by the clients were idle.

            After fixing the load issue, restarted soak, Had one client lockup, then hard server LBUG after second OST failover. Bug is LU-9547 - may not be related to this issue, we successfully performed two mds failovers.

            cliffw Cliff White (Inactive) added a comment - - edited Patch ran > 48 hours (weekend) No hard crashes, survived multiple failovers (14 OSS, 13 MDS) without a hard crash. Unfortunately detailed examination of results indicated a problem with the soak load engine, the failovers occurred by the clients were idle. After fixing the load issue, restarted soak, Had one client lockup, then hard server LBUG after second OST failover. Bug is LU-9547 - may not be related to this issue, we successfully performed two mds failovers.

            People

              laisiyao Lai Siyao
              cliffw Cliff White (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: