[LU-9504] LBUG ptlrpc_handle_rs()) ASSERTION( lock != ((void *)0) ) failed Created: 15/May/17 Updated: 19/Jun/17 Resolved: 19/Jun/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Cliff White (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | soak | ||
| Environment: |
Soak stress cluster |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
Loaded latest master build, started soak. Only fault induced was router drop. [ 893.779340] LustreError: 11-0: soaked-OST000c-osc-MDT0001: operation ost_connect to node 192.168.1.102@o2ib10 failed: rc = -16^M
[ 893.801270] LustreError: Skipped 139 previous similar messages^M
[ 894.095801] Lustre: soaked-MDT0003-osp-MDT0001: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)^M
[ 894.110919] Lustre: Skipped 7 previous similar messages^M
[ 894.253639] Lustre: soaked-MDT0001: recovery is timed out, evict stale exports^M
[ 894.265333] Lustre: soaked-MDT0001: disconnecting 28 stale clients^M
[ 894.278063] Lustre: soaked-MDT0001: Recovery over after 5:01, of 31 clients 3 recovered and 28 were evicted.^M
[ 1465.548946] Lustre: soaked-MDT0001: Client 8e539072-a775-2171-7825-433ade3d0c39 (at 192.168.1.132@o2ib100) reconnecting^M
[ 1465.563698] Lustre: soaked-MDT0001: Connection restored to 8e539072-a775-2171-7825-433ade3d0c39 (at 192.168.1.132@o2ib100)^M
[ 1465.579478] Lustre: Skipped 30 previous similar messages^M
[ 1757.516066] Lustre: soaked-MDT0001: Client 6fb4512c-e89d-8233-88ad-b696d11c9821 (at 192.168.1.138@o2ib100) reconnecting^M
[ 1757.531619] Lustre: soaked-MDT0001: Connection restored to 6fb4512c-e89d-8233-88ad-b696d11c9821 (at 192.168.1.138@o2ib100)^M
[ 1904.507160] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
[ 1942.000624] Lustre: soaked-MDT0001: Client 0a5dee9d-f606-8dd1-e9d4-d42a75c735e1 (at 192.168.1.129@o2ib100) reconnecting^M
[ 2436.463475] Lustre: soaked-MDT0001: Client 2d2ea61b-cf5f-add6-6e06-30458f85a726 (at 192.168.1.139@o2ib100) reconnecting^M
[ 2436.478139] Lustre: soaked-MDT0001: Connection restored to 2d2ea61b-cf5f-add6-6e06-30458f85a726 (at 192.168.1.139@o2ib100)^M
[ 2436.493810] Lustre: Skipped 2 previous similar messages^M
[ 2733.438247] Lustre: soaked-MDT0001: Client cb715667-fb6c-b895-e632-274b232c5bc9 (at 192.168.1.119@o2ib100) reconnecting^M
[ 3117.401693] Lustre: soaked-MDT0001: Client 0a5dee9d-f606-8dd1-e9d4-d42a75c735e1 (at 192.168.1.129@o2ib100) reconnecting^M
[ 3117.416360] Lustre: soaked-MDT0001: Connection restored to 0a5dee9d-f606-8dd1-e9d4-d42a75c735e1 (at 192.168.1.129@o2ib100)^M
[ 3117.430856] Lustre: Skipped 1 previous similar message^M
[ 3359.388740] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
[ 3497.321905] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
[ 3951.338201] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
[ 3951.353148] Lustre: soaked-MDT0001: Connection restored to 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100)^M
[ 3951.368647] Lustre: Skipped 2 previous similar messages^M
[ 4415.324113] Lustre: soaked-MDT0001: Client 6ae0a03c-6567-9e19-d483-8743882e83e1 (at 192.168.1.116@o2ib100) reconnecting^M
[ 4501.233379] Lustre: 3876:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1494873743/real 1494873743] req@ffff8803b3329b00 x1567484370368192/t0(0) o104->soaked-MDT0001@192.168.1.139@o2ib100:15/16 lens 296/224 e 0 to 1 dl 1494873750 ref 1 fl Rpc:X/0/ffffffff rc 0/-1^M
[ 4501.273401] Lustre: 3876:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages^M
[ 4632.525998] LustreError: 3049:0:(service.c:2229:ptlrpc_handle_rs()) ASSERTION( lock != ((void *)0) ) failed: ^M
[ 4632.539820] LustreError: 3049:0:(service.c:2229:ptlrpc_handle_rs()) LBUG^M
[ 4632.550400] Pid: 3049, comm: ptlrpc_hr01_004^M
[ 4632.557537] ^M
[ 4632.557537] Call Trace:^M
[ 4632.566532] [<ffffffffa080e7ee>] libcfs_call_trace+0x4e/0x60 [libcfs]^M
[ 4632.575923] [<ffffffffa080e87c>] lbug_with_loc+0x4c/0xb0 [libcfs]^M
[ 4632.585006] [<ffffffffa0ba2bed>] ptlrpc_hr_main+0x83d/0x8f0 [ptlrpc]^M
[ 4632.594107] [<ffffffff810c8345>] ? sched_clock_cpu+0x85/0xc0^M
[ 4632.602416] [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20^M
[ 4632.611086] [<ffffffffa0ba23b0>] ? ptlrpc_hr_main+0x0/0x8f0 [ptlrpc]^M
[ 4632.620009] [<ffffffff810b0a4f>] kthread+0xcf/0xe0^M
[ 4632.627027] [<ffffffff810b0980>] ? kthread+0x0/0xe0^M
[ 4632.634106] [<ffffffff81697318>] ret_from_fork+0x58/0x90^M
[ 4632.641571] [<ffffffff810b0980>] ? kthread+0x0/0xe0^M
[ 4632.648510] ^M
[ 4632.651470] Kernel panic - not syncing: LBUG^M
[ 4632.657497] CPU: 11 PID: 3049 Comm: ptlrpc_hr01_004 Tainted: P OE ------------ 3.10.0-514.16.1.el7_lustre.x86_64 #1^M
[ 4632.673033] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
[ 4632.686895] ffffffffa082cdac 00000000329e47aa ffff8808212b7d30 ffffffff81686d1f^M
[ 4632.696571] ffff8808212b7db0 ffffffff8168014a ffffffff00000008 ffff8808212b7dc0^M
[ 4632.706233] ffff8808212b7d60 00000000329e47aa 00000000329e47aa ffff88082d8cf838^M
[ 4632.715875] Call Trace:^M
[ 4632.715875] Call Trace:^M
[ 4632.719901] [<ffffffff81686d1f>] dump_stack+0x19/0x1b^M
[ 4632.726949] [<ffffffff8168014a>] panic+0xe3/0x1f2^M
[ 4632.733611] [<ffffffffa080e894>] lbug_with_loc+0x64/0xb0 [libcfs]^M
[ 4632.741856] [<ffffffffa0ba2bed>] ptlrpc_hr_main+0x83d/0x8f0 [ptlrpc]^M
[ 4632.750326] [<ffffffff810c8345>] ? sched_clock_cpu+0x85/0xc0^M
[ 4632.757970] [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20^M
[ 4632.765508] [<ffffffffa0ba23b0>] ? ptlrpc_svcpt_stop_threads+0x590/0x590 [ptlrpc]^M
[ 4632.775290] [<ffffffff810b0a4f>] kthread+0xcf/0xe0^M
[ 4632.781973] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M
[ 4632.790485] [<ffffffff81697318>] ret_from_fork+0x58/0x90^M
[ 4632.797662] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140^M
System then crashed, crash dump is available on the node. |
| Comments |
| Comment by John Hammond [ 16/May/17 ] |
|
This assertion was added by https://review.whamcloud.com/#/c/22807/ for |
| Comment by Peter Jones [ 16/May/17 ] |
|
Lai Could you please advise on this one? Thanks Peter |
| Comment by Cliff White (Inactive) [ 17/May/17 ] |
|
Hit again on build 3581 - soak fails within one hour due to this bug. |
| Comment by Cliff White (Inactive) [ 17/May/17 ] |
|
This issue was not seen on builds 3577 or 3578, the issue is likely in build 3579 |
| Comment by Peter Jones [ 17/May/17 ] |
|
To translate that into git commits
|
| Comment by Lai Siyao [ 18/May/17 ] |
|
This assert is from https://review.whamcloud.com/22807 for |
| Comment by Gerrit Updater [ 19/May/17 ] |
|
Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/27207 |
| Comment by Cliff White (Inactive) [ 19/May/17 ] |
|
Patch has survived for three hours, one mds_restart one oss_restart and one router_delay. We'll see how it does over the weekend, but so far good. |
| Comment by Cliff White (Inactive) [ 22/May/17 ] |
|
Patch ran > 48 hours (weekend) No hard crashes, survived multiple failovers After fixing the load issue, restarted soak, Had one client lockup, then hard server LBUG after second OST failover. Bug is |
| Comment by Cliff White (Inactive) [ 24/May/17 ] |
|
Hit this on soak with latest master. Patch should land to master. |
| Comment by Cliff White (Inactive) [ 31/May/17 ] |
|
Tested with latest version of patch ay 31 17:11:18 soak-9 kernel: LustreError: 4177:0:(ldlm_lock.c:2548:ldlm_lock_downgrade()) ASSERTION( lock->l_granted_mode & (LCK_PW | LCK_EX) ) failed: May 31 17:11:18 soak-9 kernel: LustreError: 4177:0:(ldlm_lock.c:2548:ldlm_lock_downgrade()) LBUG May 31 17:11:18 soak-9 kernel: Pid: 4177, comm: ptlrpc_hr01_003 May 31 17:11:18 soak-9 kernel: #012Call Trace: May 31 17:11:18 soak-9 kernel: [<ffffffffa08247ee>] libcfs_call_trace+0x4e/0x60 [libcfs] May 31 17:11:19 soak-9 kernel: [<ffffffffa082487c>] lbug_with_loc+0x4c/0xb0 [libcfs] May 31 17:11:19 soak-9 kernel: [<ffffffffa0b6283b>] ldlm_lock_downgrade+0x19b/0x1d0 [ptlrpc] May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9620>] ptlrpc_handle_rs+0x3f0/0x640 [ptlrpc] May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9955>] ptlrpc_hr_main+0xe5/0x2c0 [ptlrpc] May 31 17:11:19 soak-9 kernel: [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20 May 31 17:11:19 soak-9 kernel: [<ffffffffa0bb9870>] ? ptlrpc_hr_main+0x0/0x2c0 [ptlrpc] May 31 17:11:19 soak-9 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0 May 31 17:11:19 soak-9 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0 May 31 17:11:19 soak-9 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90 May 31 17:11:19 soak-9 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0 May 31 17:11:19 soak-9 kernel: May 31 17:11:19 soak-9 kernel: Kernel panic - not syncing: LBUG |
| Comment by Lai Siyao [ 02/Jun/17 ] |
|
https://review.whamcloud.com/#/c/27207/ is updated, and some debug messages are added, Cliff, will you test again? |
| Comment by Cliff White (Inactive) [ 05/Jun/17 ] |
|
I have loaded soak, will start tests today |
| Comment by Cliff White (Inactive) [ 05/Jun/17 ] |
|
Hit |
| Comment by Lai Siyao [ 06/Jun/17 ] |
|
Cliff, could you post more messages of crash? I added some debug messages, which should be printed upon crash. |
| Comment by Cliff White (Inactive) [ 06/Jun/17 ] |
|
I am still waiting for the failure to repeat. |
| Comment by Cliff White (Inactive) [ 07/Jun/17 ] |
|
Here is the log from prior to the LBUG. Jun 5 16:44:31 soak-11 kernel: Lustre: Skipped 1 previous similar message Jun 5 16:44:40 soak-11 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.124@o2ib100 (stopping) Jun 5 16:44:40 soak-11 kernel: Lustre: Skipped 5 previous similar messages Jun 5 16:44:44 soak-11 kernel: LustreError: 0-0: Forced cleanup waiting for mdt-soaked-MDT0002_UUID namespace with 2 resources in use, (rc=-110) Jun 5 16:45:02 soak-11 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.136@o2ib100 (stopping) Jun 5 16:45:02 soak-11 kernel: Lustre: Skipped 20 previous similar messages Jun 5 16:45:07 soak-11 kernel: Lustre: 4229:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1496681101/real 0] req@ffff8807e7265700 x1569380820308784/t0(0) o38->soaked-MDT0002-osp-MDT0003@192.168.1.110@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1496681107 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Jun 5 16:45:09 soak-11 kernel: LustreError: 0-0: Forced cleanup waiting for mdt-soaked-MDT0002_UUID namespace with 2 resources in use, (rc=-110) Jun 5 16:45:34 soak-11 kernel: LustreError: 0-0: Forced cleanup waiting for mdt-soaked-MDT0002_UUID namespace with 2 resources in use, (rc=-110) Jun 5 16:45:41 soak-11 kernel: Lustre: soaked-MDT0002: Not available for connect from 192.168.1.102@o2ib10 (stopping) Jun 5 16:45:41 soak-11 kernel: Lustre: Skipped 11 previous similar messages Jun 5 16:45:51 soak-11 kernel: LustreError: 5001:0:(lod_qos.c:208:lod_statfs_and_check()) soaked-MDT0002-mdtlov: statfs: rc = -108 Jun 5 16:45:51 soak-11 kernel: Lustre: 5001:0:(service.c:2117:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (31:69s); client may timeout. req@ffff8803f1b0ef00 x1569380804405664/t0(0) o101->bd5dff2c-b3c7-e6f9-ab0e-ec6b3d83954a@192.168.1.117@o2ib100:37/0 lens 1728/544 e 1 to 0 dl 1496681082 ref 1 fl Complete:/0/0 rc -19/-19 Jun 5 16:45:54 soak-11 kernel: Lustre: server umount soaked-MDT0002 complete Jun 5 16:45:54 soak-11 sshd[5502]: Received disconnect from 10.10.1.135: 11: disconnected by user Jun 5 16:45:54 soak-11 sshd[5502]: pam_unix(sshd:session): session closed for user root Jun 5 16:45:54 soak-11 systemd-logind: Removed session 15. Jun 5 16:45:54 soak-11 systemd: Removed slice user-0.slice. Jun 5 16:45:54 soak-11 systemd: Stopping user-0.slice. Jun 5 16:45:56 soak-11 kernel: Lustre: soaked-MDT0003: Connection restored to 192.168.1.110@o2ib10 (at 192.168.1.110@o2ib10) Jun 5 16:45:56 soak-11 kernel: Lustre: Skipped 26 previous similar messages Jun 5 16:51:12 soak-11 kernel: LustreError: 4204:0:(ldlm_lock.c:2548:ldlm_lock_downgrade()) ASSERTION( lock->l_granted_mode & (LCK_PW | LCK_EX) ) failed: |
| Comment by Cliff White (Inactive) [ 07/Jun/17 ] |
|
Hit LBUG again, syslog and lustre-logs from dead server attached |
| Comment by Lai Siyao [ 08/Jun/17 ] |
|
It's strange I don't see debug messages before LBUG(), Cliff, which update from https://review.whamcloud.com/#/c/27207/ did you use in this test? The debug messages were added since Jun 3 (update 5). And if possible can you retest with the update 7? which added more debug message right before LBUG. |
| Comment by Cliff White (Inactive) [ 08/Jun/17 ] |
|
I used build 48048, which should have had the debug patches. |
| Comment by Cliff White (Inactive) [ 08/Jun/17 ] |
|
I see a patch set 6, build 48063, I see no newer builds |
| Comment by Lai Siyao [ 08/Jun/17 ] |
|
strange, can you access https://review.whamcloud.com/#/c/27207/7 ? which is patch set 7, and the build is https://build.hpdd.intel.com/job/lustre-reviews/48061/. I just checked build 48063, which is for |
| Comment by Cliff White (Inactive) [ 08/Jun/17 ] |
|
Okay, loading that build now. |
| Comment by Cliff White (Inactive) [ 12/Jun/17 ] |
|
Failed again: Jun 11 07:25:40 soak-11 kernel: perf: interrupt took too long (2510 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
Jun 11 07:25:43 soak-11 kernel: LustreError: 4581:0:(ldlm_lock.c:2549:ldlm_lock_downgrade()) ### weird lock mode ns: mdt-soaked-MDT0003_UUID lock: ffff8806b57f9800/0x3e1187b8839d4a55 lrc: 1/0,0 mode: --/PW res: [0x2c006071b:0x169d:0x0].0x0 bits 0x2 rrc: 2 type: IBT flags: 0x44a19401000000 nid: local remote: 0x0 expref: -99 pid: 7178 timeout: 0 lvb_type: 0
Jun 11 07:25:43 soak-11 kernel: LustreError: 4581:0:(ldlm_lock.c:2550:ldlm_lock_downgrade()) ASSERTION( lock->l_granted_mode & (LCK_PW | LCK_EX) ) failed:
Jun 11 07:25:43 soak-11 kernel: LustreError: 4581:0:(ldlm_lock.c:2550:ldlm_lock_downgrade()) LBUG
Jun 11 07:25:43 soak-11 kernel: Pid: 4581, comm: ptlrpc_hr01_007
Jun 11 07:25:43 soak-11 kernel: #012Call Trace:
Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0bca7ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0bca87c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0eed7b1>] ldlm_lock_downgrade+0x111/0x210 [ptlrpc]
Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0f44818>] ptlrpc_handle_rs+0x5c8/0x700 [ptlrpc]
Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0f44a35>] ptlrpc_hr_main+0xe5/0x2c0 [ptlrpc]
Jun 11 07:25:43 soak-11 kernel: [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20
Jun 11 07:25:43 soak-11 kernel: [<ffffffffa0f44950>] ? ptlrpc_hr_main+0x0/0x2c0 [ptlrpc]
Jun 11 07:25:43 soak-11 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
Jun 11 07:25:43 soak-11 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0
Jun 11 07:25:43 soak-11 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90
Jun 11 07:25:43 soak-11 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0
Jun 11 07:25:43 soak-11 kernel:
Jun 11 07:25:43 soak-11 kernel: LustreError: dumping log to /tmp/lustre-log.1497165943.4581
|
| Comment by Cliff White (Inactive) [ 12/Jun/17 ] |
|
Attached three lustre-log, one from immediately before the LBUG, one after |
| Comment by Lai Siyao [ 13/Jun/17 ] |
|
Cliff, thanks! This helps a lot, so what happened is as below: I'll update the patch later. |
| Comment by Cliff White (Inactive) [ 16/Jun/17 ] |
|
I have run the latest patch for 48 hours on soak. No LBUGS. LFSCK does not complete, but does abort successfully |
| Comment by Gerrit Updater [ 19/Jun/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27207/ |
| Comment by Peter Jones [ 19/Jun/17 ] |
|
Landed for 2.10 |