Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.10.0
-
Soak cluster version=lustre: 2.9.52_73_gb5c4f03
-
3
-
9223372036854775807
Description
Soak is running, performing successful OSS failover, partitions are being recovered:
2017-03-30 00:47:10,761:fsmgmt.fsmgmt:INFO Next recovery check in 15s... 2017-03-30 00:47:32,586:fsmgmt.fsmgmt:DEBUG Recovery Result Record: {'soak-4': {'soaked-OST000f': 'RECOVERING', 'soaked-OST000e': 'INACTIVE', 'soaked-OST0008': 'INACTIVE', 'soaked-OST0009': 'RECOVERING', 'soaked-OST0002': 'INACTIVE', 'soaked-OST0003': 'RECOVERING', 'soaked-OST0015': 'COMPLETE', 'soaked-OST0014': 'INACTIVE'}}
Single client has LBUG, after recovering some partitions
Mar 30 00:47:33 soak-36 kernel: Lustre: soaked-OST0009-osc-ffff88085b72c000: Connection restored to 192.168.1.104@o2ib10 (at 192.168.1.104@o2ib10) .....:q Mar 30 00:48:31 soak-36 kernel: LustreError: 4753:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: Mar 30 00:48:31 soak-36 kernel: LustreError: 4753:0:(recover.c:157:ptlrpc_replay_next()) LBUG Mar 30 00:48:31 soak-36 kernel: Pid: 4753, comm: ptlrpcd_rcv Mar 30 00:48:31 soak-36 kernel: #012Call Trace: Mar 30 00:48:31 soak-36 kernel: [<ffffffffa092c7f3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] Mar 30 00:48:31 soak-36 kernel: [<ffffffffa092c861>] lbug_with_loc+0x41/0xb0 [libcfs] Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d31c87>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d55682>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d2a2ff>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d2e0b5>] ptlrpc_check_set.part.23+0x425/0x1dd0 [ptlrpc] Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d2fabb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d5bb8b>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d5bf3b>] ptlrpcd+0x2bb/0x560 [ptlrpc] Mar 30 00:48:31 soak-36 kernel: [<ffffffff810c4fd0>] ? default_wake_function+0x0/0x20 Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d5bc80>] ? ptlrpcd+0x0/0x560 [ptlrpc] Mar 30 00:48:32 soak-36 kernel: [<ffffffff810b064f>] kthread+0xcf/0xe0 Mar 30 00:48:32 soak-36 kernel: [<ffffffff810b0580>] ? kthread+0x0/0xe0 Mar 30 00:48:32 soak-36 kernel: [<ffffffff81696958>] ret_from_fork+0x58/0x90 Mar 30 00:48:32 soak-36 kernel: [<ffffffff810b0580>] ? kthread+0x0/0xe0 Mar 30 00:48:32 soak-36 kernel: Mar 30 00:48:32 soak-36 kernel: Kernel panic - not syncing: LBUG
vmcore-dmesg.txt attached. Full crash dump is available on soak-36
Attachments
- soak-10.2.log.gz
- 9.34 MB
- soak-10.debug.log.gz
- 6.50 MB
- soak-10.dump.txt.gz
- 3.78 MB
- soak-10.log.gz
- 188 kB
- soak-10.lustre.log.txt.gz
- 10.34 MB
- soak-10.stack.txt
- 1.23 MB
- soak-10.stacks.txt
- 1.07 MB
- soak-11.debug.log.gz
- 15.27 MB
- soak-11.dump.txt.gz
- 5.81 MB
- soak-11.lustre.log.txt.gz
- 11.37 MB
- soak-11.stacks.txt
- 1.25 MB
- soak-11.stacks.txt.gz
- 109 kB
- soak-16.syslog.txt
- 4.15 MB
- soak-29.console.txt.gz
- 10.38 MB
- soak-8.debug.log.gz
- 6.57 MB
- soak-8.dump.txt.gz
- 5.85 MB
- soak-8.lustre.log.txt.gz
- 11.09 MB
- soak-8.stacks.txt
- 1.36 MB
- soak-8.stacks.txt.gz
- 116 kB
- soak-9.debug.log.gz
- 15.41 MB
- soak-9.dump.txt.gz
- 4.25 MB
- soak-9.lustre.log.txt.gz
- 10.93 MB
- soak-9.stacks.txt
- 1.35 MB
- soak-9.stacks.txt.gz
- 114 kB
- vmcore-dmesg.txt
- 135 kB
Issue Links
Activity
Peter, I think the fix from https://review.whamcloud.com/#/c/27920/ should be enough.
Niu
the test also included a reconfiguration on soak which it was hoped might make this race less likely as the disk faliure had coincided with this race being hit far more frenquently (see DCO-7264), Given that this has not worked would it be enough to just land this one fix or did earlier testing suggest further fixes would be needed too?
Peter
Cliff, the fix hasn't been landed yet, so the crash is expected.
Now running lustre-b2_10 build #3
Jul 11 01:13:20 soak-17 kernel: LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: Jul 11 01:13:20 soak-17 kernel: LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) LBUG Jul 11 01:13:20 soak-17 kernel: Pid: 2937, comm: ptlrpcd_rcv Jul 11 01:13:20 soak-17 kernel: #012Call Trace: Jul 11 01:13:20 soak-17 kernel: [<ffffffffa08447ee>] libcfs_call_trace+0x4e/0x60 [libcfs] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa084487c>] lbug_with_loc+0x4c/0xb0 [libcfs] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b76817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b99de2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b6ef1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b72c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b7464b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0ba02bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0ba066b>] ptlrpcd+0x2bb/0x560 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffff810c5080>] ? default_wake_function+0x0/0x20 Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0ba03b0>] ? ptlrpcd+0x0/0x560 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffff810b06ff>] kthread+0xcf/0xe0 Jul 11 01:13:20 soak-17 kernel: [<ffffffff810b0630>] ? kthread+0x0/0xe0 Jul 11 01:13:20 soak-17 kernel: [<ffffffff81696a58>] ret_from_fork+0x58/0x90 Jul 11 01:13:20 soak-17 kernel: [<ffffffff810b0630>] ? kthread+0x0/0xe0 Jul 11 01:13:20 soak-17 kernel: Jul 11 01:13:20 soak-17 kernel: Kernel panic - not syncing: LBUG
Crash dump is available on soak-17
[ 3947.210017] perf: interrupt took too long (3225 > 3192), lowering kernel.perf_event_max_sample_rate to 62000 [ 4161.246530] Lustre: 2953:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1499735243/real 0] req@ffff880397b67200 x1572582391402544/t0(0) o101->soaked-OST0008-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 328/400 e 0 to 1 dl 1499735256 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [ 4161.282464] Lustre: soaked-OST0008-osc-ffff8808276d0000: Connection to soaked-OST0008 (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete [ 4162.030527] Lustre: soaked-OST000e-osc-ffff8808276d0000: Connection to soaked-OST000e (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete [ 4169.926508] Lustre: soaked-OST0014-osc-ffff8808276d0000: Connection to soaked-OST0014 (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete [ 4194.948696] Lustre: 2944:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1499735290/real 1499735290] req@ffff880334b43c00 x1572582391418656/t0(0) o400->soaked-OST0002-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 224/224 e 0 to 1 dl 1499735303 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [ 4194.986416] Lustre: 2944:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages [ 4194.998649] Lustre: soaked-OST0002-osc-ffff8808276d0000: Connection to soaked-OST0002 (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete [ 4245.020397] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1499735340/real 1499735340] req@ffff880334b42100 x1572582391449120/t0(0) o8->soaked-OST0002-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 520/544 e 0 to 1 dl 1499735351 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [ 4245.057630] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages [ 4320.019906] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1499735415/real 1499735415] req@ffff880335d55a00 x1572582391496080/t0(0) o8->soaked-OST0008-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 520/544 e 0 to 1 dl 1499735436 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [ 4320.057297] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages [ 4451.018399] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1499735515/real 0] req@ffff880335d53c00 x1572582391561888/t0(0) o8->soaked-OST0014-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 520/544 e 0 to 1 dl 1499735546 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [ 4451.054303] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages [ 4504.566715] LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: [ 4504.587823] LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) LBUG [ 4504.598875] Pid: 2937, comm: ptlrpcd_rcv [ 4504.606014] Call Trace: [ 4504.615457] [<ffffffffa08447ee>] libcfs_call_trace+0x4e/0x60 [libcfs] [ 4504.624991] [<ffffffffa084487c>] lbug_with_loc+0x4c/0xb0 [libcfs] [ 4504.634021] [<ffffffffa0b76817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] [ 4504.643628] [<ffffffffa0b99de2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] [ 4504.654851] [<ffffffffa0b6ef1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] [ 4504.664682] [<ffffffffa0b72c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] [ 4504.674633] [<ffffffffa0b7464b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [ 4504.683405] [<ffffffffa0ba02bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] [ 4504.692018] [<ffffffffa0ba066b>] ptlrpcd+0x2bb/0x560 [ptlrpc] [ 4504.699993] [<ffffffff810c5080>] ? default_wake_function+0x0/0x20 [ 4504.708298] [<ffffffffa0ba03b0>] ? ptlrpcd+0x0/0x560 [ptlrpc] [ 4504.716207] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [ 4504.723002] [<ffffffff810b0630>] ? kthread+0x0/0xe0 [ 4504.729868] [<ffffffff81696a58>] ret_from_fork+0x58/0x90 [ 4504.737213] [<ffffffff810b0630>] ? kthread+0x0/0xe0 [ 4504.744039] [ 4504.746963] Kernel panic - not syncing: LBUG [ 4504.752969] CPU: 14 PID: 2937 Comm: ptlrpcd_rcv Tainted: G OE ------------ 3.10.0-514.10.2.el7.x86_64 #1 [ 4504.766051] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [ 4504.779814] ffffffffa0862e8b 0000000057e50b83 ffff88082ab1fb30 ffffffff816864ef [ 4504.789340] ffff88082ab1fbb0 ffffffff8167f8f6 ffffffff00000008 ffff88082ab1fbc0 [ 4504.798846] ffff88082ab1fb60 0000000057e50b83 0000000057e50b83 ffff88082d98f838 [ 4504.808330] Call Trace: [ 4504.812212] [<ffffffff816864ef>] dump_stack+0x19/0x1b [ 4504.819084] [<ffffffff8167f8f6>] panic+0xe3/0x1f2 [ 4504.825543] [<ffffffffa0844894>] lbug_with_loc+0x64/0xb0 [libcfs] [ 4504.833554] [<ffffffffa0b76817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] [ 4504.842225] [<ffffffffa0b99de2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] [ 4504.852623] [<ffffffffa0b6ef1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] [ 4504.861745] [<ffffffffa0b72c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] [ 4504.871040] [<ffffffffa0b7464b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [ 4504.879243] [<ffffffffa0ba02bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] [ 4504.887319] [<ffffffffa0ba066b>] ptlrpcd+0x2bb/0x560 [ptlrpc]
There is a crash dump available on soak-31
[ 4874.102156] Lustre: 3604:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages [ 4928.198801] Lustre: soaked-OST0002-osc-ffff88105a2ec000: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib) [ 4929.712442] Lustre: soaked-OST0014-osc-ffff88105a2ec000: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib) [ 4929.724803] Lustre: Skipped 1 previous similar message [ 4931.461826] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: [ 4931.475358] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) LBUG [ 4931.482951] Pid: 3604, comm: ptlrpcd_rcv [ 4931.487341] Call Trace: [ 4931.491756] [<ffffffffa07807ee>] libcfs_call_trace+0x4e/0x60 [libcfs] [ 4931.499055] [<ffffffffa078087c>] lbug_with_loc+0x4c/0xb0 [libcfs] [ 4931.506024] [<ffffffffa0b19817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] [ 4931.513639] [<ffffffffa0b3cde2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] [ 4931.522999] [<ffffffffa0b11f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] [ 4931.531098] [<ffffffffa0b15c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] [ 4931.539393] [<ffffffffa0b1764b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [ 4931.546623] [<ffffffffa0b432bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] [ 4931.553754] [<ffffffffa0b4366b>] ptlrpcd+0x2bb/0x560 [ptlrpc] [ 4931.560279] [<ffffffff810c5080>] ? default_wake_function+0x0/0x20 [ 4931.567215] [<ffffffffa0b433b0>] ? ptlrpcd+0x0/0x560 [ptlrpc] [ 4931.573730] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [ 4931.579187] [<ffffffff810b0630>] ? kthread+0x0/0xe0 [ 4931.584740] [<ffffffff81696a58>] ret_from_fork+0x58/0x90 [ 4931.590778] [<ffffffff810b0630>] ? kthread+0x0/0xe0 [ 4931.596326] [ 4931.597997] Kernel panic - not syncing: LBUG [ 4931.602769] CPU: 5 PID: 3604 Comm: ptlrpcd_rcv Tainted: G IOE ------------ 3.10.0-514.10.2.el7.x86_64 #1 [ 4931.614514] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 4931.626160] ffffffffa079ee8b 00000000f7bba2b7 ffff880841747b30 ffffffff816864ef [ 4931.634457] ffff880841747bb0 ffffffff8167f8f6 ffffffff00000008 ffff880841747bc0 [ 4931.642744] ffff880841747b60 00000000f7bba2b7 00000000f7bba2b7 ffff88085ed4f838 [ 4931.651035] Call Trace: [ 4931.653764] [<ffffffff816864ef>] dump_stack+0x19/0x1b [ 4931.659507] [<ffffffff8167f8f6>] panic+0xe3/0x1f2 [ 4931.664856] [<ffffffffa0780894>] lbug_with_loc+0x64/0xb0 [libcfs] [ 4931.671781] [<ffffffffa0b19817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] [ 4931.679388] [<ffffffffa0b3cde2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] [ 4931.688742] [<ffffffffa0b11f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] [ 4931.696832] [<ffffffffa0b15c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] [ 4931.705118] [<ffffffffa0b1764b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [ 4931.712339] [<ffffffffa0b432bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] [ 4931.719465] [<ffffffffa0b4366b>] ptlrpcd+0x2bb/0x560 [ptlrpc] [ 4931.725981] [<ffffffff810c5080>] ? wake_up_state+0x20/0x20 [ 4931.732224] [<ffffffffa0b433b0>] ? ptlrpcd_check+0x5d0/0x5d0 [ptlrpc] [ 4931.739517] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [ 4931.744959] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 [ 4931.752243] [<ffffffff81696a58>] ret_from_fork+0x58/0x90 [ 4931.758267] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
Hit this again on lustre-master, build 3606
[ 4931.461826] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: ^M [ 4931.475358] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) LBUG^M [ 4931.482951] Pid: 3604, comm: ptlrpcd_rcv^M [ 4931.487341] ^M [ 4931.487341] Call Trace:^M [ 4931.491756] [<ffffffffa07807ee>] libcfs_call_trace+0x4e/0x60 [libcfs]^M [ 4931.499055] [<ffffffffa078087c>] lbug_with_loc+0x4c/0xb0 [libcfs]^M [ 4931.506024] [<ffffffffa0b19817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc]^M [ 4931.513639] [<ffffffffa0b3cde2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc]^M [ 4931.522999] [<ffffffffa0b11f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc]^M [ 4931.531098] [<ffffffffa0b15c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc]^M [ 4931.539393] [<ffffffffa0b1764b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]^M [ 4931.546623] [<ffffffffa0b432bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]^M [ 4931.553754] [<ffffffffa0b4366b>] ptlrpcd+0x2bb/0x560 [ptlrpc]^M [ 4931.560279] [<ffffffff810c5080>] ? default_wake_function+0x0/0x20^M [ 4931.567215] [<ffffffffa0b433b0>] ? ptlrpcd+0x0/0x560 [ptlrpc]^M [ 4931.573730] [<ffffffff810b06ff>] kthread+0xcf/0xe0^M [ 4931.579187] [<ffffffff810b0630>] ? kthread+0x0/0xe0^M [ 4931.584740] [<ffffffff81696a58>] ret_from_fork+0x58/0x90^M [ 4931.590778] [<ffffffff810b0630>] ? kthread+0x0/0xe0^M [ 4931.596326] ^M [ 4931.597997] Kernel panic - not syncing: LBUG^M
This is found in soak-10.2.log:
soak-10.2.log:00010000:02000400:9.0:1499351737.374320:0:3602:0:(ldlm_lib.c:2080:target_recovery_overseer()) soaked-MDT0002: recovery is timed out, evict stale exports soak-10.2.log:00010000:00080000:9.0:1499351737.427959:0:3602:0:(ldlm_lib.c:2044:target_recovery_overseer()) soaked-MDT0002: there are still update replay (0x1120005fc96)in the queue. soak-10.2.log:00010000:00000040:9.0:1499351737.427965:0:3602:0:(ldlm_lib.c:2504:target_recovery_thread()) 1: request replay stage - 1 clients from t1176821431446 soak-10.2.log:00010000:00080000:9.0:1499351737.427983:0:3602:0:(ldlm_lib.c:2044:target_recovery_overseer()) soaked-MDT0002: there are still update replay (0x1120005fc96)in the queue.
It looks that DNE recovery not finished, this may cause system hung or other failures, I'll check how this happened.
Thank you, Cliff. Seems there is something wrong with the DNE recovery, I see there is a OI scrub thread hung for waiting io completion on soak-8, but not sure if it's related to the recovery failure problem:
[19587.652446] OI_scrub D ffff666666663838 0 28932 2 0x00000080 [19587.660994] ffff8807d14c7890 0000000000000046 ffff88082c3f9f60 ffff8807d14c7fd8 [19587.669968] ffff8803d0049898 ffff8807d14c7fd8 ffff88082c3f9f60 ffff88042e056c40 [19587.679285] ffff88042c8a1f78 7fffffffffffffff ffff88039a765000 ffffffff8168a8a0 [19587.688282] Call Trace: [19587.691650] [<ffffffff8168a8a0>] ? bit_wait+0x50/0x50 [19587.698046] [<ffffffff8168c849>] schedule+0x29/0x70 [19587.704230] [<ffffffff8168a289>] schedule_timeout+0x239/0x2c0 [19587.711411] [<ffffffffa0272f19>] ? dm_old_request_fn+0xc9/0x1f0 [dm_mod] [19587.719649] [<ffffffff810eb0dc>] ? ktime_get_ts64+0x4c/0xf0 [19587.726616] [<ffffffff8168a8a0>] ? bit_wait+0x50/0x50 [19587.732999] [<ffffffff8168bdee>] io_schedule_timeout+0xae/0x130 [19587.740348] [<ffffffff813252b1>] ? vsnprintf+0x441/0x6a0 [19587.747378] [<ffffffff8168be88>] ? io_schedule+0x18/0x20 [19587.754050] [<ffffffffa1167b3c>] ? ldiskfs_map_blocks+0x5c/0x5e0 [ldiskfs] [19587.762459] [<ffffffff8168a3d5>] ? __wait_on_bit+0x65/0x90 [19587.769313] [<ffffffff8168a8a0>] ? bit_wait+0x50/0x50 [19587.775671] [<ffffffff8168a481>] ? out_of_line_wait_on_bit+0x81/0xb0 [19587.783486] [<ffffffff810b1be0>] ? wake_bit_function+0x40/0x40 [19587.790716] [<ffffffff8123341a>] ? __wait_on_buffer+0x2a/0x30 [19587.797859] [<ffffffffa116833c>] ? ldiskfs_bread+0x7c/0xc0 [ldiskfs] [19587.806016] [<ffffffffa122f0a0>] ? iam_node_read+0x60/0xe0 [osd_ldiskfs] [19587.814213] [<ffffffffa122a6c7>] ? fid_is_on_ost+0x1c7/0x420 [osd_ldiskfs] [19587.822600] [<ffffffffa122a9c4>] ? osd_oi_lookup+0xa4/0x160 [osd_ldiskfs] [19587.830894] [<ffffffffa0bc43b3>] ? libcfs_log_goto+0x23/0x30 [libcfs] [19587.838792] [<ffffffffa123de09>] ? osd_scrub_get_fid+0x1f9/0x290 [osd_ldiskfs] [19587.847565] [<ffffffffa1242a62>] ? osd_scrub_exec+0x62/0x5b0 [osd_ldiskfs] [19587.855967] [<ffffffffa11631fe>] ? ldiskfs_read_inode_bitmap+0x1fe/0x5b0 [ldiskfs] [19587.865156] [<ffffffffa1244225>] ? osd_inode_iteration+0x475/0xc70 [osd_ldiskfs] [19587.874494] [<ffffffffa1242a00>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs] [19587.883303] [<ffffffffa123eab0>] ? osd_preload_next+0xa0/0xa0 [osd_ldiskfs] [19587.891811] [<ffffffffa1245360>] ? osd_scrub_main+0x940/0xf00 [osd_ldiskfs] [19587.900318] [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20 [19587.907177] [<ffffffffa1244a20>] ? osd_inode_iteration+0xc70/0xc70 [osd_ldiskfs] [19587.916170] [<ffffffff810b0a4f>] ? kthread+0xcf/0xe0 [19587.922453] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 [19587.930372] [<ffffffff81697798>] ? ret_from_fork+0x58/0x90 [19587.937217] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
Lai, could you take a look at the logs and stack traces to see if there is anything suspicious? Thank you.
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27920/
Subject:
LU-9274ptlrpc: add replay request into unreplied listProject: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7d29d3167684b13a612c8c1bff860019a218115c