[LU-9274] LBUG: (recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.10.0, Lustre 2.11.0
Affects Version/s: Lustre 2.10.0
Labels:
- soak
Environment:
Soak cluster version=lustre: 2.9.52_73_gb5c4f03

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Soak is running, performing successful OSS failover, partitions are being recovered:

2017-03-30 00:47:10,761:fsmgmt.fsmgmt:INFO     Next recovery check in 15s...
2017-03-30 00:47:32,586:fsmgmt.fsmgmt:DEBUG    Recovery Result Record: {'soak-4': {'soaked-OST000f': 'RECOVERING', 'soaked-OST000e': 'INACTIVE', 'soaked-OST0008': 'INACTIVE', 'soaked-OST0009': 'RECOVERING', 'soaked-OST0002': 'INACTIVE', 'soaked-OST0003': 'RECOVERING', 'soaked-OST0015': 'COMPLETE', 'soaked-OST0014': 'INACTIVE'}}

Single client has LBUG, after recovering some partitions

Mar 30 00:47:33 soak-36 kernel: Lustre: soaked-OST0009-osc-ffff88085b72c000: Connection restored to 192.168.1.104@o2ib10 (at 192.168.1.104@o2ib10)
.....:q
Mar 30 00:48:31 soak-36 kernel: LustreError: 4753:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed:
Mar 30 00:48:31 soak-36 kernel: LustreError: 4753:0:(recover.c:157:ptlrpc_replay_next()) LBUG
Mar 30 00:48:31 soak-36 kernel: Pid: 4753, comm: ptlrpcd_rcv
Mar 30 00:48:31 soak-36 kernel: #012Call Trace:
Mar 30 00:48:31 soak-36 kernel: [<ffffffffa092c7f3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
Mar 30 00:48:31 soak-36 kernel: [<ffffffffa092c861>] lbug_with_loc+0x41/0xb0 [libcfs]
Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d31c87>] ptlrpc_replay_next+0x447/0x450 [ptlrpc]
Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d55682>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc]
Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d2a2ff>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc]
Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d2e0b5>] ptlrpc_check_set.part.23+0x425/0x1dd0 [ptlrpc]
Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d2fabb>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d5bb8b>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d5bf3b>] ptlrpcd+0x2bb/0x560 [ptlrpc]
Mar 30 00:48:31 soak-36 kernel: [<ffffffff810c4fd0>] ? default_wake_function+0x0/0x20
Mar 30 00:48:31 soak-36 kernel: [<ffffffffa0d5bc80>] ? ptlrpcd+0x0/0x560 [ptlrpc]
Mar 30 00:48:32 soak-36 kernel: [<ffffffff810b064f>] kthread+0xcf/0xe0
Mar 30 00:48:32 soak-36 kernel: [<ffffffff810b0580>] ? kthread+0x0/0xe0
Mar 30 00:48:32 soak-36 kernel: [<ffffffff81696958>] ret_from_fork+0x58/0x90
Mar 30 00:48:32 soak-36 kernel: [<ffffffff810b0580>] ? kthread+0x0/0xe0
Mar 30 00:48:32 soak-36 kernel:
Mar 30 00:48:32 soak-36 kernel: Kernel panic - not syncing: LBUG

vmcore-dmesg.txt attached. Full crash dump is available on soak-36

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

soak-10.2.log.gz
9.34 MB
06/Jul/17 2:42 PM
soak-10.debug.log.gz
6.50 MB
06/Jul/17 4:04 PM
soak-10.dump.txt.gz
3.78 MB
06/Jul/17 6:43 PM
soak-10.log.gz
188 kB
06/Jul/17 2:26 PM
soak-10.lustre.log.txt.gz
10.34 MB
11/Jul/17 1:26 AM
soak-10.stack.txt
1.23 MB
06/Jul/17 3:37 PM
soak-10.stacks.txt
1.07 MB
06/Jul/17 2:26 PM
soak-10.stacks.txt.gz
80 kB
06/Jul/17 6:43 PM
soak-11.debug.log.gz
15.27 MB
06/Jul/17 4:05 PM
soak-11.dump.txt.gz
5.81 MB
06/Jul/17 6:43 PM
soak-11.lustre.log.txt.gz
11.37 MB
11/Jul/17 1:26 AM
soak-11.stacks.txt
1.25 MB
06/Jul/17 3:37 PM
soak-11.stacks.txt.gz
109 kB
06/Jul/17 6:43 PM
soak-16.lustre-log.1498910471.2905.gz
70 kB
03/Jul/17 2:41 PM
soak-16.syslog.txt
4.15 MB
23/Jun/17 2:35 PM
soak-29.console.txt.gz
10.38 MB
28/Jun/17 2:51 PM
soak-8.debug.log.gz
6.57 MB
06/Jul/17 4:03 PM
soak-8.dump.txt.gz
5.85 MB
06/Jul/17 6:43 PM
soak-8.lustre.log.txt.gz
11.09 MB
11/Jul/17 1:25 AM
soak-8.stacks.txt
1.36 MB
06/Jul/17 3:37 PM
soak-8.stacks.txt.gz
116 kB
06/Jul/17 6:43 PM
soak-9.debug.log.gz
15.41 MB
06/Jul/17 4:03 PM
soak-9.dump.txt.gz
4.25 MB
06/Jul/17 6:43 PM
soak-9.lustre.log.txt.gz
10.93 MB
11/Jul/17 1:25 AM
soak-9.stacks.txt
1.35 MB
06/Jul/17 3:37 PM
soak-9.stacks.txt.gz
114 kB
06/Jul/17 6:43 PM
vmcore-dmesg.txt
135 kB
30/Mar/17 3:10 PM

Issue Links

is related to

LU-9748 DNE recovery hangs, blocks Lustre recovery

Resolved

mentioned in: Page No Confluence page found with the given URL.

Activity

[LU-9274] LBUG: (recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed:

Niu Yawei (Inactive) added a comment - 11/Jul/17 1:35 AM

Cliff, the fix hasn't been landed yet, so the crash is expected.

Niu Yawei (Inactive) added a comment - 11/Jul/17 1:35 AM Cliff, the fix hasn't been landed yet, so the crash is expected.

Cliff White (Inactive) added a comment - 11/Jul/17 1:26 AM

Attached lustre dump from all MDS

Cliff White (Inactive) added a comment - 11/Jul/17 1:26 AM Attached lustre dump from all MDS

Cliff White (Inactive) added a comment - 11/Jul/17 1:22 AM

Now running lustre-b2_10 build #3

Jul 11 01:13:20 soak-17 kernel: LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed:
Jul 11 01:13:20 soak-17 kernel: LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) LBUG
Jul 11 01:13:20 soak-17 kernel: Pid: 2937, comm: ptlrpcd_rcv
Jul 11 01:13:20 soak-17 kernel: #012Call Trace:
Jul 11 01:13:20 soak-17 kernel: [<ffffffffa08447ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
Jul 11 01:13:20 soak-17 kernel: [<ffffffffa084487c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b76817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc]
Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b99de2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc]
Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b6ef1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc]
Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b72c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc]
Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b7464b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0ba02bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0ba066b>] ptlrpcd+0x2bb/0x560 [ptlrpc]
Jul 11 01:13:20 soak-17 kernel: [<ffffffff810c5080>] ? default_wake_function+0x0/0x20
Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0ba03b0>] ? ptlrpcd+0x0/0x560 [ptlrpc]
Jul 11 01:13:20 soak-17 kernel: [<ffffffff810b06ff>] kthread+0xcf/0xe0
Jul 11 01:13:20 soak-17 kernel: [<ffffffff810b0630>] ? kthread+0x0/0xe0
Jul 11 01:13:20 soak-17 kernel: [<ffffffff81696a58>] ret_from_fork+0x58/0x90
Jul 11 01:13:20 soak-17 kernel: [<ffffffff810b0630>] ? kthread+0x0/0xe0
Jul 11 01:13:20 soak-17 kernel:
Jul 11 01:13:20 soak-17 kernel: Kernel panic - not syncing: LBUG

Crash dump is available on soak-17

[ 3947.210017] perf: interrupt took too long (3225 > 3192), lowering kernel.perf_event_max_sample_rate to 62000
[ 4161.246530] Lustre: 2953:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1499735243/real 0]  req@ffff880397b67200 x1572582391402544/t0(0) o101->soaked-OST0008-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 328/400 e 0 to 1 dl 1499735256 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[ 4161.282464] Lustre: soaked-OST0008-osc-ffff8808276d0000: Connection to soaked-OST0008 (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[ 4162.030527] Lustre: soaked-OST000e-osc-ffff8808276d0000: Connection to soaked-OST000e (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[ 4169.926508] Lustre: soaked-OST0014-osc-ffff8808276d0000: Connection to soaked-OST0014 (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[ 4194.948696] Lustre: 2944:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1499735290/real 1499735290]  req@ffff880334b43c00 x1572582391418656/t0(0) o400->soaked-OST0002-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 224/224 e 0 to 1 dl 1499735303 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[ 4194.986416] Lustre: 2944:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
[ 4194.998649] Lustre: soaked-OST0002-osc-ffff8808276d0000: Connection to soaked-OST0002 (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[ 4245.020397] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1499735340/real 1499735340]  req@ffff880334b42100 x1572582391449120/t0(0) o8->soaked-OST0002-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 520/544 e 0 to 1 dl 1499735351 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[ 4245.057630] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
[ 4320.019906] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1499735415/real 1499735415]  req@ffff880335d55a00 x1572582391496080/t0(0) o8->soaked-OST0008-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 520/544 e 0 to 1 dl 1499735436 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[ 4320.057297] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
[ 4451.018399] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1499735515/real 0]  req@ffff880335d53c00 x1572582391561888/t0(0) o8->soaked-OST0014-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 520/544 e 0 to 1 dl 1499735546 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[ 4451.054303] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
[ 4504.566715] LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed:
[ 4504.587823] LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) LBUG
[ 4504.598875] Pid: 2937, comm: ptlrpcd_rcv
[ 4504.606014]
Call Trace:
[ 4504.615457]  [<ffffffffa08447ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
[ 4504.624991]  [<ffffffffa084487c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[ 4504.634021]  [<ffffffffa0b76817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc]
[ 4504.643628]  [<ffffffffa0b99de2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc]
[ 4504.654851]  [<ffffffffa0b6ef1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc]
[ 4504.664682]  [<ffffffffa0b72c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc]
[ 4504.674633]  [<ffffffffa0b7464b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
[ 4504.683405]  [<ffffffffa0ba02bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
[ 4504.692018]  [<ffffffffa0ba066b>] ptlrpcd+0x2bb/0x560 [ptlrpc]
[ 4504.699993]  [<ffffffff810c5080>] ? default_wake_function+0x0/0x20
[ 4504.708298]  [<ffffffffa0ba03b0>] ? ptlrpcd+0x0/0x560 [ptlrpc]
[ 4504.716207]  [<ffffffff810b06ff>] kthread+0xcf/0xe0
[ 4504.723002]  [<ffffffff810b0630>] ? kthread+0x0/0xe0
[ 4504.729868]  [<ffffffff81696a58>] ret_from_fork+0x58/0x90
[ 4504.737213]  [<ffffffff810b0630>] ? kthread+0x0/0xe0
[ 4504.744039]
[ 4504.746963] Kernel panic - not syncing: LBUG
[ 4504.752969] CPU: 14 PID: 2937 Comm: ptlrpcd_rcv Tainted: G           OE  ------------   3.10.0-514.10.2.el7.x86_64 #1
[ 4504.766051] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
[ 4504.779814]  ffffffffa0862e8b 0000000057e50b83 ffff88082ab1fb30 ffffffff816864ef
[ 4504.789340]  ffff88082ab1fbb0 ffffffff8167f8f6 ffffffff00000008 ffff88082ab1fbc0
[ 4504.798846]  ffff88082ab1fb60 0000000057e50b83 0000000057e50b83 ffff88082d98f838
[ 4504.808330] Call Trace:
[ 4504.812212]  [<ffffffff816864ef>] dump_stack+0x19/0x1b
[ 4504.819084]  [<ffffffff8167f8f6>] panic+0xe3/0x1f2
[ 4504.825543]  [<ffffffffa0844894>] lbug_with_loc+0x64/0xb0 [libcfs]
[ 4504.833554]  [<ffffffffa0b76817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc]
[ 4504.842225]  [<ffffffffa0b99de2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc]
[ 4504.852623]  [<ffffffffa0b6ef1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc]
[ 4504.861745]  [<ffffffffa0b72c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc]
[ 4504.871040]  [<ffffffffa0b7464b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
[ 4504.879243]  [<ffffffffa0ba02bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
[ 4504.887319]  [<ffffffffa0ba066b>] ptlrpcd+0x2bb/0x560 [ptlrpc]

Cliff White (Inactive) added a comment - 11/Jul/17 1:22 AM Now running lustre-b2_10 build #3 Jul 11 01:13:20 soak-17 kernel: LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: Jul 11 01:13:20 soak-17 kernel: LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) LBUG Jul 11 01:13:20 soak-17 kernel: Pid: 2937, comm: ptlrpcd_rcv Jul 11 01:13:20 soak-17 kernel: #012Call Trace: Jul 11 01:13:20 soak-17 kernel: [<ffffffffa08447ee>] libcfs_call_trace+0x4e/0x60 [libcfs] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa084487c>] lbug_with_loc+0x4c/0xb0 [libcfs] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b76817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b99de2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b6ef1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b72c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0b7464b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0ba02bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0ba066b>] ptlrpcd+0x2bb/0x560 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffff810c5080>] ? default_wake_function+0x0/0x20 Jul 11 01:13:20 soak-17 kernel: [<ffffffffa0ba03b0>] ? ptlrpcd+0x0/0x560 [ptlrpc] Jul 11 01:13:20 soak-17 kernel: [<ffffffff810b06ff>] kthread+0xcf/0xe0 Jul 11 01:13:20 soak-17 kernel: [<ffffffff810b0630>] ? kthread+0x0/0xe0 Jul 11 01:13:20 soak-17 kernel: [<ffffffff81696a58>] ret_from_fork+0x58/0x90 Jul 11 01:13:20 soak-17 kernel: [<ffffffff810b0630>] ? kthread+0x0/0xe0 Jul 11 01:13:20 soak-17 kernel: Jul 11 01:13:20 soak-17 kernel: Kernel panic - not syncing: LBUG Crash dump is available on soak-17 [ 3947.210017] perf: interrupt took too long (3225 > 3192), lowering kernel.perf_event_max_sample_rate to 62000 [ 4161.246530] Lustre: 2953:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1499735243/real 0] req@ffff880397b67200 x1572582391402544/t0(0) o101->soaked-OST0008-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 328/400 e 0 to 1 dl 1499735256 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [ 4161.282464] Lustre: soaked-OST0008-osc-ffff8808276d0000: Connection to soaked-OST0008 (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete [ 4162.030527] Lustre: soaked-OST000e-osc-ffff8808276d0000: Connection to soaked-OST000e (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete [ 4169.926508] Lustre: soaked-OST0014-osc-ffff8808276d0000: Connection to soaked-OST0014 (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete [ 4194.948696] Lustre: 2944:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1499735290/real 1499735290] req@ffff880334b43c00 x1572582391418656/t0(0) o400->soaked-OST0002-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 224/224 e 0 to 1 dl 1499735303 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [ 4194.986416] Lustre: 2944:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages [ 4194.998649] Lustre: soaked-OST0002-osc-ffff8808276d0000: Connection to soaked-OST0002 (at 192.168.1.104@o2ib) was lost; in progress operations using this service will wait for recovery to complete [ 4245.020397] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1499735340/real 1499735340] req@ffff880334b42100 x1572582391449120/t0(0) o8->soaked-OST0002-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 520/544 e 0 to 1 dl 1499735351 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [ 4245.057630] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages [ 4320.019906] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1499735415/real 1499735415] req@ffff880335d55a00 x1572582391496080/t0(0) o8->soaked-OST0008-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 520/544 e 0 to 1 dl 1499735436 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [ 4320.057297] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages [ 4451.018399] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1499735515/real 0] req@ffff880335d53c00 x1572582391561888/t0(0) o8->soaked-OST0014-osc-ffff8808276d0000@192.168.1.104@o2ib:28/4 lens 520/544 e 0 to 1 dl 1499735546 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [ 4451.054303] Lustre: 2937:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages [ 4504.566715] LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: [ 4504.587823] LustreError: 2937:0:(recover.c:157:ptlrpc_replay_next()) LBUG [ 4504.598875] Pid: 2937, comm: ptlrpcd_rcv [ 4504.606014] Call Trace: [ 4504.615457] [<ffffffffa08447ee>] libcfs_call_trace+0x4e/0x60 [libcfs] [ 4504.624991] [<ffffffffa084487c>] lbug_with_loc+0x4c/0xb0 [libcfs] [ 4504.634021] [<ffffffffa0b76817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] [ 4504.643628] [<ffffffffa0b99de2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] [ 4504.654851] [<ffffffffa0b6ef1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] [ 4504.664682] [<ffffffffa0b72c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] [ 4504.674633] [<ffffffffa0b7464b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [ 4504.683405] [<ffffffffa0ba02bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] [ 4504.692018] [<ffffffffa0ba066b>] ptlrpcd+0x2bb/0x560 [ptlrpc] [ 4504.699993] [<ffffffff810c5080>] ? default_wake_function+0x0/0x20 [ 4504.708298] [<ffffffffa0ba03b0>] ? ptlrpcd+0x0/0x560 [ptlrpc] [ 4504.716207] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [ 4504.723002] [<ffffffff810b0630>] ? kthread+0x0/0xe0 [ 4504.729868] [<ffffffff81696a58>] ret_from_fork+0x58/0x90 [ 4504.737213] [<ffffffff810b0630>] ? kthread+0x0/0xe0 [ 4504.744039] [ 4504.746963] Kernel panic - not syncing: LBUG [ 4504.752969] CPU: 14 PID: 2937 Comm: ptlrpcd_rcv Tainted: G OE ------------ 3.10.0-514.10.2.el7.x86_64 #1 [ 4504.766051] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [ 4504.779814] ffffffffa0862e8b 0000000057e50b83 ffff88082ab1fb30 ffffffff816864ef [ 4504.789340] ffff88082ab1fbb0 ffffffff8167f8f6 ffffffff00000008 ffff88082ab1fbc0 [ 4504.798846] ffff88082ab1fb60 0000000057e50b83 0000000057e50b83 ffff88082d98f838 [ 4504.808330] Call Trace: [ 4504.812212] [<ffffffff816864ef>] dump_stack+0x19/0x1b [ 4504.819084] [<ffffffff8167f8f6>] panic+0xe3/0x1f2 [ 4504.825543] [<ffffffffa0844894>] lbug_with_loc+0x64/0xb0 [libcfs] [ 4504.833554] [<ffffffffa0b76817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] [ 4504.842225] [<ffffffffa0b99de2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] [ 4504.852623] [<ffffffffa0b6ef1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] [ 4504.861745] [<ffffffffa0b72c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] [ 4504.871040] [<ffffffffa0b7464b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [ 4504.879243] [<ffffffffa0ba02bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] [ 4504.887319] [<ffffffffa0ba066b>] ptlrpcd+0x2bb/0x560 [ptlrpc]

Cliff White (Inactive) added a comment - 10/Jul/17 5:08 PM

There is a crash dump available on soak-31

[ 4874.102156] Lustre: 3604:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
[ 4928.198801] Lustre: soaked-OST0002-osc-ffff88105a2ec000: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib)
[ 4929.712442] Lustre: soaked-OST0014-osc-ffff88105a2ec000: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib)
[ 4929.724803] Lustre: Skipped 1 previous similar message
[ 4931.461826] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed:
[ 4931.475358] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) LBUG
[ 4931.482951] Pid: 3604, comm: ptlrpcd_rcv
[ 4931.487341]
Call Trace:
[ 4931.491756]  [<ffffffffa07807ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
[ 4931.499055]  [<ffffffffa078087c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[ 4931.506024]  [<ffffffffa0b19817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc]
[ 4931.513639]  [<ffffffffa0b3cde2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc]
[ 4931.522999]  [<ffffffffa0b11f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc]
[ 4931.531098]  [<ffffffffa0b15c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc]
[ 4931.539393]  [<ffffffffa0b1764b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
[ 4931.546623]  [<ffffffffa0b432bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
[ 4931.553754]  [<ffffffffa0b4366b>] ptlrpcd+0x2bb/0x560 [ptlrpc]
[ 4931.560279]  [<ffffffff810c5080>] ? default_wake_function+0x0/0x20
[ 4931.567215]  [<ffffffffa0b433b0>] ? ptlrpcd+0x0/0x560 [ptlrpc]
[ 4931.573730]  [<ffffffff810b06ff>] kthread+0xcf/0xe0
[ 4931.579187]  [<ffffffff810b0630>] ? kthread+0x0/0xe0
[ 4931.584740]  [<ffffffff81696a58>] ret_from_fork+0x58/0x90
[ 4931.590778]  [<ffffffff810b0630>] ? kthread+0x0/0xe0
[ 4931.596326]
[ 4931.597997] Kernel panic - not syncing: LBUG
[ 4931.602769] CPU: 5 PID: 3604 Comm: ptlrpcd_rcv Tainted: G          IOE  ------------   3.10.0-514.10.2.el7.x86_64 #1
[ 4931.614514] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
[ 4931.626160]  ffffffffa079ee8b 00000000f7bba2b7 ffff880841747b30 ffffffff816864ef
[ 4931.634457]  ffff880841747bb0 ffffffff8167f8f6 ffffffff00000008 ffff880841747bc0
[ 4931.642744]  ffff880841747b60 00000000f7bba2b7 00000000f7bba2b7 ffff88085ed4f838
[ 4931.651035] Call Trace:
[ 4931.653764]  [<ffffffff816864ef>] dump_stack+0x19/0x1b
[ 4931.659507]  [<ffffffff8167f8f6>] panic+0xe3/0x1f2
[ 4931.664856]  [<ffffffffa0780894>] lbug_with_loc+0x64/0xb0 [libcfs]
[ 4931.671781]  [<ffffffffa0b19817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc]
[ 4931.679388]  [<ffffffffa0b3cde2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc]
[ 4931.688742]  [<ffffffffa0b11f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc]
[ 4931.696832]  [<ffffffffa0b15c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc]
[ 4931.705118]  [<ffffffffa0b1764b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
[ 4931.712339]  [<ffffffffa0b432bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]
[ 4931.719465]  [<ffffffffa0b4366b>] ptlrpcd+0x2bb/0x560 [ptlrpc]
[ 4931.725981]  [<ffffffff810c5080>] ? wake_up_state+0x20/0x20
[ 4931.732224]  [<ffffffffa0b433b0>] ? ptlrpcd_check+0x5d0/0x5d0 [ptlrpc]
[ 4931.739517]  [<ffffffff810b06ff>] kthread+0xcf/0xe0
[ 4931.744959]  [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
[ 4931.752243]  [<ffffffff81696a58>] ret_from_fork+0x58/0x90
[ 4931.758267]  [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140

Cliff White (Inactive) added a comment - 10/Jul/17 5:08 PM There is a crash dump available on soak-31 [ 4874.102156] Lustre: 3604:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages [ 4928.198801] Lustre: soaked-OST0002-osc-ffff88105a2ec000: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib) [ 4929.712442] Lustre: soaked-OST0014-osc-ffff88105a2ec000: Connection restored to 192.168.1.104@o2ib (at 192.168.1.104@o2ib) [ 4929.724803] Lustre: Skipped 1 previous similar message [ 4931.461826] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: [ 4931.475358] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) LBUG [ 4931.482951] Pid: 3604, comm: ptlrpcd_rcv [ 4931.487341] Call Trace: [ 4931.491756] [<ffffffffa07807ee>] libcfs_call_trace+0x4e/0x60 [libcfs] [ 4931.499055] [<ffffffffa078087c>] lbug_with_loc+0x4c/0xb0 [libcfs] [ 4931.506024] [<ffffffffa0b19817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] [ 4931.513639] [<ffffffffa0b3cde2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] [ 4931.522999] [<ffffffffa0b11f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] [ 4931.531098] [<ffffffffa0b15c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] [ 4931.539393] [<ffffffffa0b1764b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [ 4931.546623] [<ffffffffa0b432bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] [ 4931.553754] [<ffffffffa0b4366b>] ptlrpcd+0x2bb/0x560 [ptlrpc] [ 4931.560279] [<ffffffff810c5080>] ? default_wake_function+0x0/0x20 [ 4931.567215] [<ffffffffa0b433b0>] ? ptlrpcd+0x0/0x560 [ptlrpc] [ 4931.573730] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [ 4931.579187] [<ffffffff810b0630>] ? kthread+0x0/0xe0 [ 4931.584740] [<ffffffff81696a58>] ret_from_fork+0x58/0x90 [ 4931.590778] [<ffffffff810b0630>] ? kthread+0x0/0xe0 [ 4931.596326] [ 4931.597997] Kernel panic - not syncing: LBUG [ 4931.602769] CPU: 5 PID: 3604 Comm: ptlrpcd_rcv Tainted: G IOE ------------ 3.10.0-514.10.2.el7.x86_64 #1 [ 4931.614514] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 4931.626160] ffffffffa079ee8b 00000000f7bba2b7 ffff880841747b30 ffffffff816864ef [ 4931.634457] ffff880841747bb0 ffffffff8167f8f6 ffffffff00000008 ffff880841747bc0 [ 4931.642744] ffff880841747b60 00000000f7bba2b7 00000000f7bba2b7 ffff88085ed4f838 [ 4931.651035] Call Trace: [ 4931.653764] [<ffffffff816864ef>] dump_stack+0x19/0x1b [ 4931.659507] [<ffffffff8167f8f6>] panic+0xe3/0x1f2 [ 4931.664856] [<ffffffffa0780894>] lbug_with_loc+0x64/0xb0 [libcfs] [ 4931.671781] [<ffffffffa0b19817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc] [ 4931.679388] [<ffffffffa0b3cde2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc] [ 4931.688742] [<ffffffffa0b11f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc] [ 4931.696832] [<ffffffffa0b15c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc] [ 4931.705118] [<ffffffffa0b1764b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] [ 4931.712339] [<ffffffffa0b432bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc] [ 4931.719465] [<ffffffffa0b4366b>] ptlrpcd+0x2bb/0x560 [ptlrpc] [ 4931.725981] [<ffffffff810c5080>] ? wake_up_state+0x20/0x20 [ 4931.732224] [<ffffffffa0b433b0>] ? ptlrpcd_check+0x5d0/0x5d0 [ptlrpc] [ 4931.739517] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [ 4931.744959] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 [ 4931.752243] [<ffffffff81696a58>] ret_from_fork+0x58/0x90 [ 4931.758267] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140

Cliff White (Inactive) added a comment - 10/Jul/17 5:07 PM

Hit this again on lustre-master, build 3606

[ 4931.461826] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: ^M
[ 4931.475358] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) LBUG^M
[ 4931.482951] Pid: 3604, comm: ptlrpcd_rcv^M
[ 4931.487341] ^M
[ 4931.487341] Call Trace:^M
[ 4931.491756]  [<ffffffffa07807ee>] libcfs_call_trace+0x4e/0x60 [libcfs]^M
[ 4931.499055]  [<ffffffffa078087c>] lbug_with_loc+0x4c/0xb0 [libcfs]^M
[ 4931.506024]  [<ffffffffa0b19817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc]^M
[ 4931.513639]  [<ffffffffa0b3cde2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc]^M
[ 4931.522999]  [<ffffffffa0b11f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc]^M
[ 4931.531098]  [<ffffffffa0b15c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc]^M
[ 4931.539393]  [<ffffffffa0b1764b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]^M
[ 4931.546623]  [<ffffffffa0b432bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]^M
[ 4931.553754]  [<ffffffffa0b4366b>] ptlrpcd+0x2bb/0x560 [ptlrpc]^M
[ 4931.560279]  [<ffffffff810c5080>] ? default_wake_function+0x0/0x20^M
[ 4931.567215]  [<ffffffffa0b433b0>] ? ptlrpcd+0x0/0x560 [ptlrpc]^M
[ 4931.573730]  [<ffffffff810b06ff>] kthread+0xcf/0xe0^M
[ 4931.579187]  [<ffffffff810b0630>] ? kthread+0x0/0xe0^M
[ 4931.584740]  [<ffffffff81696a58>] ret_from_fork+0x58/0x90^M
[ 4931.590778]  [<ffffffff810b0630>] ? kthread+0x0/0xe0^M
[ 4931.596326] ^M
[ 4931.597997] Kernel panic - not syncing: LBUG^M

Cliff White (Inactive) added a comment - 10/Jul/17 5:07 PM Hit this again on lustre-master, build 3606 [ 4931.461826] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) ASSERTION( !list_empty(&req->rq_cli.cr_unreplied_list) ) failed: ^M [ 4931.475358] LustreError: 3604:0:(recover.c:157:ptlrpc_replay_next()) LBUG^M [ 4931.482951] Pid: 3604, comm: ptlrpcd_rcv^M [ 4931.487341] ^M [ 4931.487341] Call Trace:^M [ 4931.491756] [<ffffffffa07807ee>] libcfs_call_trace+0x4e/0x60 [libcfs]^M [ 4931.499055] [<ffffffffa078087c>] lbug_with_loc+0x4c/0xb0 [libcfs]^M [ 4931.506024] [<ffffffffa0b19817>] ptlrpc_replay_next+0x447/0x450 [ptlrpc]^M [ 4931.513639] [<ffffffffa0b3cde2>] ptlrpc_import_recovery_state_machine+0x1d2/0xbc0 [ptlrpc]^M [ 4931.522999] [<ffffffffa0b11f1f>] ptlrpc_replay_interpret+0x17f/0x7d0 [ptlrpc]^M [ 4931.531098] [<ffffffffa0b15c7c>] ptlrpc_check_set.part.23+0x42c/0x1da0 [ptlrpc]^M [ 4931.539393] [<ffffffffa0b1764b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]^M [ 4931.546623] [<ffffffffa0b432bb>] ptlrpcd_check+0x4db/0x5d0 [ptlrpc]^M [ 4931.553754] [<ffffffffa0b4366b>] ptlrpcd+0x2bb/0x560 [ptlrpc]^M [ 4931.560279] [<ffffffff810c5080>] ? default_wake_function+0x0/0x20^M [ 4931.567215] [<ffffffffa0b433b0>] ? ptlrpcd+0x0/0x560 [ptlrpc]^M [ 4931.573730] [<ffffffff810b06ff>] kthread+0xcf/0xe0^M [ 4931.579187] [<ffffffff810b0630>] ? kthread+0x0/0xe0^M [ 4931.584740] [<ffffffff81696a58>] ret_from_fork+0x58/0x90^M [ 4931.590778] [<ffffffff810b0630>] ? kthread+0x0/0xe0^M [ 4931.596326] ^M [ 4931.597997] Kernel panic - not syncing: LBUG^M

Cliff White (Inactive) added a comment - 07/Jul/17 6:57 PM

I have created ~~LU-9748~~ for the DNE recovery issue

Cliff White (Inactive) added a comment - 07/Jul/17 6:57 PM I have created LU-9748 for the DNE recovery issue

Lai Siyao added a comment - 07/Jul/17 11:44 AM

This is found in soak-10.2.log:

soak-10.2.log:00010000:02000400:9.0:1499351737.374320:0:3602:0:(ldlm_lib.c:2080:target_recovery_overseer()) soaked-MDT0002: recovery is timed out, evict stale exports
soak-10.2.log:00010000:00080000:9.0:1499351737.427959:0:3602:0:(ldlm_lib.c:2044:target_recovery_overseer()) soaked-MDT0002: there are still update replay (0x1120005fc96)in the queue.
soak-10.2.log:00010000:00000040:9.0:1499351737.427965:0:3602:0:(ldlm_lib.c:2504:target_recovery_thread()) 1: request replay stage - 1 clients from t1176821431446
soak-10.2.log:00010000:00080000:9.0:1499351737.427983:0:3602:0:(ldlm_lib.c:2044:target_recovery_overseer()) soaked-MDT0002: there are still update replay (0x1120005fc96)in the queue.

It looks that DNE recovery not finished, this may cause system hung or other failures, I'll check how this happened.

Lai Siyao added a comment - 07/Jul/17 11:44 AM This is found in soak-10.2.log: soak-10.2.log:00010000:02000400:9.0:1499351737.374320:0:3602:0:(ldlm_lib.c:2080:target_recovery_overseer()) soaked-MDT0002: recovery is timed out, evict stale exports soak-10.2.log:00010000:00080000:9.0:1499351737.427959:0:3602:0:(ldlm_lib.c:2044:target_recovery_overseer()) soaked-MDT0002: there are still update replay (0x1120005fc96)in the queue. soak-10.2.log:00010000:00000040:9.0:1499351737.427965:0:3602:0:(ldlm_lib.c:2504:target_recovery_thread()) 1: request replay stage - 1 clients from t1176821431446 soak-10.2.log:00010000:00080000:9.0:1499351737.427983:0:3602:0:(ldlm_lib.c:2044:target_recovery_overseer()) soaked-MDT0002: there are still update replay (0x1120005fc96)in the queue. It looks that DNE recovery not finished, this may cause system hung or other failures, I'll check how this happened.

Niu Yawei (Inactive) added a comment - 07/Jul/17 3:38 AM

Thank you, Cliff. Seems there is something wrong with the DNE recovery, I see there is a OI scrub thread hung for waiting io completion on soak-8, but not sure if it's related to the recovery failure problem:

[19587.652446] OI_scrub        D ffff666666663838     0 28932      2 0x00000080
[19587.660994]  ffff8807d14c7890 0000000000000046 ffff88082c3f9f60 ffff8807d14c7fd8
[19587.669968]  ffff8803d0049898 ffff8807d14c7fd8 ffff88082c3f9f60 ffff88042e056c40
[19587.679285]  ffff88042c8a1f78 7fffffffffffffff ffff88039a765000 ffffffff8168a8a0
[19587.688282] Call Trace:
[19587.691650]  [<ffffffff8168a8a0>] ? bit_wait+0x50/0x50
[19587.698046]  [<ffffffff8168c849>] schedule+0x29/0x70
[19587.704230]  [<ffffffff8168a289>] schedule_timeout+0x239/0x2c0
[19587.711411]  [<ffffffffa0272f19>] ? dm_old_request_fn+0xc9/0x1f0 [dm_mod]
[19587.719649]  [<ffffffff810eb0dc>] ? ktime_get_ts64+0x4c/0xf0
[19587.726616]  [<ffffffff8168a8a0>] ? bit_wait+0x50/0x50
[19587.732999]  [<ffffffff8168bdee>] io_schedule_timeout+0xae/0x130
[19587.740348]  [<ffffffff813252b1>] ? vsnprintf+0x441/0x6a0
[19587.747378]  [<ffffffff8168be88>] ? io_schedule+0x18/0x20
[19587.754050]  [<ffffffffa1167b3c>] ? ldiskfs_map_blocks+0x5c/0x5e0 [ldiskfs]
[19587.762459]  [<ffffffff8168a3d5>] ? __wait_on_bit+0x65/0x90
[19587.769313]  [<ffffffff8168a8a0>] ? bit_wait+0x50/0x50
[19587.775671]  [<ffffffff8168a481>] ? out_of_line_wait_on_bit+0x81/0xb0
[19587.783486]  [<ffffffff810b1be0>] ? wake_bit_function+0x40/0x40
[19587.790716]  [<ffffffff8123341a>] ? __wait_on_buffer+0x2a/0x30
[19587.797859]  [<ffffffffa116833c>] ? ldiskfs_bread+0x7c/0xc0 [ldiskfs]
[19587.806016]  [<ffffffffa122f0a0>] ? iam_node_read+0x60/0xe0 [osd_ldiskfs]
[19587.814213]  [<ffffffffa122a6c7>] ? fid_is_on_ost+0x1c7/0x420 [osd_ldiskfs]
[19587.822600]  [<ffffffffa122a9c4>] ? osd_oi_lookup+0xa4/0x160 [osd_ldiskfs]
[19587.830894]  [<ffffffffa0bc43b3>] ? libcfs_log_goto+0x23/0x30 [libcfs]
[19587.838792]  [<ffffffffa123de09>] ? osd_scrub_get_fid+0x1f9/0x290 [osd_ldiskfs]
[19587.847565]  [<ffffffffa1242a62>] ? osd_scrub_exec+0x62/0x5b0 [osd_ldiskfs]
[19587.855967]  [<ffffffffa11631fe>] ? ldiskfs_read_inode_bitmap+0x1fe/0x5b0 [ldiskfs]
[19587.865156]  [<ffffffffa1244225>] ? osd_inode_iteration+0x475/0xc70 [osd_ldiskfs]
[19587.874494]  [<ffffffffa1242a00>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs]
[19587.883303]  [<ffffffffa123eab0>] ? osd_preload_next+0xa0/0xa0 [osd_ldiskfs]
[19587.891811]  [<ffffffffa1245360>] ? osd_scrub_main+0x940/0xf00 [osd_ldiskfs]
[19587.900318]  [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20
[19587.907177]  [<ffffffffa1244a20>] ? osd_inode_iteration+0xc70/0xc70 [osd_ldiskfs]
[19587.916170]  [<ffffffff810b0a4f>] ? kthread+0xcf/0xe0
[19587.922453]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[19587.930372]  [<ffffffff81697798>] ? ret_from_fork+0x58/0x90
[19587.937217]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140

Lai, could you take a look at the logs and stack traces to see if there is anything suspicious? Thank you.

Niu Yawei (Inactive) added a comment - 07/Jul/17 3:38 AM Thank you, Cliff. Seems there is something wrong with the DNE recovery, I see there is a OI scrub thread hung for waiting io completion on soak-8, but not sure if it's related to the recovery failure problem: [19587.652446] OI_scrub D ffff666666663838 0 28932 2 0x00000080 [19587.660994] ffff8807d14c7890 0000000000000046 ffff88082c3f9f60 ffff8807d14c7fd8 [19587.669968] ffff8803d0049898 ffff8807d14c7fd8 ffff88082c3f9f60 ffff88042e056c40 [19587.679285] ffff88042c8a1f78 7fffffffffffffff ffff88039a765000 ffffffff8168a8a0 [19587.688282] Call Trace: [19587.691650] [<ffffffff8168a8a0>] ? bit_wait+0x50/0x50 [19587.698046] [<ffffffff8168c849>] schedule+0x29/0x70 [19587.704230] [<ffffffff8168a289>] schedule_timeout+0x239/0x2c0 [19587.711411] [<ffffffffa0272f19>] ? dm_old_request_fn+0xc9/0x1f0 [dm_mod] [19587.719649] [<ffffffff810eb0dc>] ? ktime_get_ts64+0x4c/0xf0 [19587.726616] [<ffffffff8168a8a0>] ? bit_wait+0x50/0x50 [19587.732999] [<ffffffff8168bdee>] io_schedule_timeout+0xae/0x130 [19587.740348] [<ffffffff813252b1>] ? vsnprintf+0x441/0x6a0 [19587.747378] [<ffffffff8168be88>] ? io_schedule+0x18/0x20 [19587.754050] [<ffffffffa1167b3c>] ? ldiskfs_map_blocks+0x5c/0x5e0 [ldiskfs] [19587.762459] [<ffffffff8168a3d5>] ? __wait_on_bit+0x65/0x90 [19587.769313] [<ffffffff8168a8a0>] ? bit_wait+0x50/0x50 [19587.775671] [<ffffffff8168a481>] ? out_of_line_wait_on_bit+0x81/0xb0 [19587.783486] [<ffffffff810b1be0>] ? wake_bit_function+0x40/0x40 [19587.790716] [<ffffffff8123341a>] ? __wait_on_buffer+0x2a/0x30 [19587.797859] [<ffffffffa116833c>] ? ldiskfs_bread+0x7c/0xc0 [ldiskfs] [19587.806016] [<ffffffffa122f0a0>] ? iam_node_read+0x60/0xe0 [osd_ldiskfs] [19587.814213] [<ffffffffa122a6c7>] ? fid_is_on_ost+0x1c7/0x420 [osd_ldiskfs] [19587.822600] [<ffffffffa122a9c4>] ? osd_oi_lookup+0xa4/0x160 [osd_ldiskfs] [19587.830894] [<ffffffffa0bc43b3>] ? libcfs_log_goto+0x23/0x30 [libcfs] [19587.838792] [<ffffffffa123de09>] ? osd_scrub_get_fid+0x1f9/0x290 [osd_ldiskfs] [19587.847565] [<ffffffffa1242a62>] ? osd_scrub_exec+0x62/0x5b0 [osd_ldiskfs] [19587.855967] [<ffffffffa11631fe>] ? ldiskfs_read_inode_bitmap+0x1fe/0x5b0 [ldiskfs] [19587.865156] [<ffffffffa1244225>] ? osd_inode_iteration+0x475/0xc70 [osd_ldiskfs] [19587.874494] [<ffffffffa1242a00>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs] [19587.883303] [<ffffffffa123eab0>] ? osd_preload_next+0xa0/0xa0 [osd_ldiskfs] [19587.891811] [<ffffffffa1245360>] ? osd_scrub_main+0x940/0xf00 [osd_ldiskfs] [19587.900318] [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20 [19587.907177] [<ffffffffa1244a20>] ? osd_inode_iteration+0xc70/0xc70 [osd_ldiskfs] [19587.916170] [<ffffffff810b0a4f>] ? kthread+0xcf/0xe0 [19587.922453] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 [19587.930372] [<ffffffff81697798>] ? ret_from_fork+0x58/0x90 [19587.937217] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 Lai, could you take a look at the logs and stack traces to see if there is anything suspicious? Thank you.

Cliff White (Inactive) added a comment - 06/Jul/17 6:15 PM - edited

I am seeing two things constantly with this build:

soak-10
----------------
mdt.soaked-MDT0002.recovery_status=
status: WAITING
non-ready MDTs:  0000 0001 0003
recovery_start: 1499361565
time_waited: 3037

after failover, recovery takes forever/quite a long time on MDS.

2. When LFSCK is started, it times out and the abort command is issued by the framework. The abort takes a very long time.

2017-07-06 17:29:30,838:fsmgmt.fsmgmt:INFO     executing cmd: lctl lfsck_stop -M soaked-MDT0000
2017-07-06 18:13:57,169:fsmgmt.fsmgmt:INFO     lfsck stopped on soak-8
2017-07-06 18:13:57,170:fsmgmt.fsmgmt:INFO     mds_restart completed, lfsck completed
{code]
Lustre-log dumps from all MDS and further stack traces attached

[ 3113.596481] Lustre: soaked-MDT0002: Recovery already passed deadline 30:28, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.^M
[ 3113.617684] Lustre: soaked-MDT0002: Connection restored to a953293b-361e-a006-a3a3-0e98a9c1b777 (at 172.16.1.49@o2ib1)^M
[ 3318.928937] Lustre: 3702:0:(ldlm_lib.c:1784:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reaching hard limit: 900, extend: 1^M
[ 3318.946182] Lustre: 3702:0:(ldlm_lib.c:1784:extend_recovery_timer()) Skipped 31 previous similar messages^M
[ 3628.813529] Lustre: soaked-MDT0002: recovery is timed out, evict stale exports^M
[ 3628.824816] Lustre: soaked-MDT0002: disconnecting 2 stale clients^M
[ 3628.836011] Lustre: 3702:0:(ldlm_lib.c:1630:abort_req_replay_queue()) @@@ aborted: req@ffff8803e67c8300 x1572096734652000/t0(1181119750277) o35->13a5519c-4a88-30e7-e970-6d733706fa78@172.16.1.45@o2ib1:501/0 lens 512/0 e 129 to 0 dl 1499364816 ref 1 fl Complete:/4/ffffffff rc 0/-1^M
[ 3628.893326] Lustre: soaked-MDT0002: Denying connection for new client a519ea7f-c98c-a97e-e2cf-40376209045c(at 192.168.1.116@o2ib), waiting for 37 known clients (34 recovered, 1 in progress, and 2 evicted) to recover in 21188466:16^M
[ 3628.900998] Lustre: soaked-MDT0002: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID^M
[ 3628.901105] Lustre: soaked-MDT0002: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib)^M
[ 3628.908451] Lustre: 3702:0:(ldlm_lib.c:2049:target_recovery_overseer()) soaked-MDT0002 recovery is aborted by hard timeout^M
[ 3628.908459] Lustre: 3702:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery^M
[ 3628.917834] Lustre: soaked-MDT0002: Recovery over after 54:04, of 37 clients 34 recovered and 3 were evicted.^M
[ 3652.712709] Lustre: soaked-MDT0002: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib)^M
[ 3652.726408] Lustre: Skipped 6 previous similar messages^M
[ 3659.981726] LustreError: 3722:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 968 actual 344.^M

Cliff White (Inactive) added a comment - 06/Jul/17 6:15 PM - edited I am seeing two things constantly with this build: soak-10 ---------------- mdt.soaked-MDT0002.recovery_status= status: WAITING non-ready MDTs: 0000 0001 0003 recovery_start: 1499361565 time_waited: 3037 after failover, recovery takes forever/quite a long time on MDS. 2. When LFSCK is started, it times out and the abort command is issued by the framework. The abort takes a very long time. 2017-07-06 17:29:30,838:fsmgmt.fsmgmt:INFO executing cmd: lctl lfsck_stop -M soaked-MDT0000 2017-07-06 18:13:57,169:fsmgmt.fsmgmt:INFO lfsck stopped on soak-8 2017-07-06 18:13:57,170:fsmgmt.fsmgmt:INFO mds_restart completed, lfsck completed {code] Lustre-log dumps from all MDS and further stack traces attached [ 3113.596481] Lustre: soaked-MDT0002: Recovery already passed deadline 30:28, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.^M [ 3113.617684] Lustre: soaked-MDT0002: Connection restored to a953293b-361e-a006-a3a3-0e98a9c1b777 (at 172.16.1.49@o2ib1)^M [ 3318.928937] Lustre: 3702:0:(ldlm_lib.c:1784:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reaching hard limit: 900, extend: 1^M [ 3318.946182] Lustre: 3702:0:(ldlm_lib.c:1784:extend_recovery_timer()) Skipped 31 previous similar messages^M [ 3628.813529] Lustre: soaked-MDT0002: recovery is timed out, evict stale exports^M [ 3628.824816] Lustre: soaked-MDT0002: disconnecting 2 stale clients^M [ 3628.836011] Lustre: 3702:0:(ldlm_lib.c:1630:abort_req_replay_queue()) @@@ aborted: req@ffff8803e67c8300 x1572096734652000/t0(1181119750277) o35->13a5519c-4a88-30e7-e970-6d733706fa78@172.16.1.45@o2ib1:501/0 lens 512/0 e 129 to 0 dl 1499364816 ref 1 fl Complete:/4/ffffffff rc 0/-1^M [ 3628.893326] Lustre: soaked-MDT0002: Denying connection for new client a519ea7f-c98c-a97e-e2cf-40376209045c(at 192.168.1.116@o2ib), waiting for 37 known clients (34 recovered, 1 in progress, and 2 evicted) to recover in 21188466:16^M [ 3628.900998] Lustre: soaked-MDT0002: Received new LWP connection from 192.168.1.108@o2ib, removing former export from same NID^M [ 3628.901105] Lustre: soaked-MDT0002: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib)^M [ 3628.908451] Lustre: 3702:0:(ldlm_lib.c:2049:target_recovery_overseer()) soaked-MDT0002 recovery is aborted by hard timeout^M [ 3628.908459] Lustre: 3702:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery^M [ 3628.917834] Lustre: soaked-MDT0002: Recovery over after 54:04, of 37 clients 34 recovered and 3 were evicted.^M [ 3652.712709] Lustre: soaked-MDT0002: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib)^M [ 3652.726408] Lustre: Skipped 6 previous similar messages^M [ 3659.981726] LustreError: 3722:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 968 actual 344.^M

Niu Yawei (Inactive) added a comment - 06/Jul/17 1:55 AM

All the evicted clients are MDT-MDT client, it seems like a DNE recovery problem, and not related to this patch. Cliff, did you ever observe similar recovery failures during past soak testings?

Are these clients (192.168.1.103@o2ib, 106, 109 and 111) evicted by MGS from MDTs or Lustre clients? Did you observe any hung MDT during recovery? Could you attach the logs from all MDTs in tickets? It's quite slow from here to login and operate on soak nodes. (if any MDT hang during recovery, could you dump the
full stack trace of it and attach it in ticket as well). Thanks in advance.

Niu Yawei (Inactive) added a comment - 06/Jul/17 1:55 AM All the evicted clients are MDT-MDT client, it seems like a DNE recovery problem, and not related to this patch. Cliff, did you ever observe similar recovery failures during past soak testings? Are these clients (192.168.1.103@o2ib, 106, 109 and 111) evicted by MGS from MDTs or Lustre clients? Did you observe any hung MDT during recovery? Could you attach the logs from all MDTs in tickets? It's quite slow from here to login and operate on soak nodes. (if any MDT hang during recovery, could you dump the full stack trace of it and attach it in ticket as well). Thanks in advance.

Cliff White (Inactive) added a comment - 05/Jul/17 8:10 PM

I am not seeing LBUGS, so far, but am seeing a consistent pattern where after a failover, most/all clients are evicted and recovery is aborted. Some sample messages

Jul  5 15:05:21 soak-8 kernel: Lustre: 4763:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Jul  5 15:05:21 soak-8 kernel: Lustre: soaked-MDT0000: Recovery over after 2:18, of 37 clients 3 recovered and 34 were evicted.
Jul  5 15:37:02 soak-8 kernel: Lustre: MGS: haven't heard from client 1b0cd471-e969-acd0-bd39-45b87774c0c3 (at 192.168.1.107@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8804b978ac00, cur 1499269022 expire 1499268872 last 1499268795
Jul  5 16:47:07 soak-8 kernel: Lustre: MGS: haven't heard from client adfa67c6-0880-1a84-5ecf-172c556a5dd6 (at 192.168.1.111@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8807dd474c00, cur 1499273227 expire 1499273077 last 1499273000
Jul  5 16:47:16 soak-8 kernel: Lustre: soaked-MDT0000: haven't heard from client soaked-MDT0003-mdtlov_UUID (at 192.168.1.111@o2ib) in 228 seconds. I think it's dead, and I am evicting it. exp ffff8807dea70400, cur 1499273236 expire 1499273086 last 1499273008
Jul  5 16:50:06 soak-8 kernel: LustreError: 167-0: soaked-MDT0003-osp-MDT0000: This client was evicted by soaked-MDT0003; in progress operations using this service will fail.
Jul  5 17:18:19 soak-8 kernel: Lustre: MGS: haven't heard from client 8fe08516-dbc0-85e3-6fb4-93c94095a877 (at 192.168.1.106@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8807dd475800, cur 1499275099 expire 1499274949 last 1499274872
Jul  5 18:14:24 soak-8 kernel: Lustre: 4771:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Jul  5 18:14:24 soak-8 kernel: Lustre: soaked-MDT0000: Denying connection for new client a1e375f2-7cfb-7e06-98b7-7ea8a7f2eca4(at 192.168.1.130@o2ib), waiting for 37 known clients (13 recovered, 0 in progress, and 24 evicted) to recover in 4:02
Jul  5 18:14:24 soak-8 kernel: Lustre: soaked-MDT0000: Recovery over after 11:00, of 37 clients 13 recovered and 24 were evicted.
Jul  5 18:44:53 soak-8 kernel: Lustre: MGS: haven't heard from client e13056ec-cbfc-e019-c014-c1a0cfd67815 (at 192.168.1.103@o2ib) in 229 seconds. I think it's dead, and I am evicting it. exp ffff88017917cc00, cur 1499280293 expire 1499280143 last 1499280064
Jul  5 19:07:31 soak-8 kernel: Lustre: MGS: haven't heard from client 9b21a845-8a0f-5a86-e846-32a309c2b726 (at 192.168.1.111@o2ib) in 233 seconds. I think it's dead, and I am evicting it. exp ffff88017976dc00, cur 1499281651 expire 1499281501 last 1499281418
Jul  5 19:10:18 soak-8 kernel: LustreError: 167-0: soaked-MDT0003-osp-MDT0000: This client was evicted by soaked-MDT0003; in progress operations using this service will fail.
Jul  5 19:35:10 soak-8 kernel: Lustre: MGS: haven't heard from client 687666be-1c95-4869-bffb-261e5c2ec0e7 (at 192.168.1.111@o2ib) in 228 seconds. I think it's dead, and I am evicting it. exp ffff8807dbcfe000, cur 1499283310 expire 1499283160 last 1499283082
Jul  5 19:37:55 soak-8 kernel: LustreError: 167-0: soaked-MDT0003-osp-MDT0000: This client was evicted by soaked-MDT0003; in progress operations using this service will fail.
Jul  5 20:06:07 soak-8 kernel: Lustre: MGS: haven't heard from client a43f28ec-dc91-d483-5a19-102f4ecfe255 (at 192.168.1.109@o2ib) in 228 seconds. I think it's dead, and I am evicting it. exp ffff880404926c00, cur 1499285167 expire 1499285017 last 1499284939
Jul  5 20:07:15 soak-8 kernel: Lustre: 15754:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Jul  5 20:08:00 soak-8 kernel: LustreError: 167-0: soaked-MDT0001-osp-MDT0000: This client was evicted by soaked-MDT0001; in progress operations using this service will fail.

It would be good if someone could login and have a look.

Cliff White (Inactive) added a comment - 05/Jul/17 8:10 PM I am not seeing LBUGS, so far, but am seeing a consistent pattern where after a failover, most/all clients are evicted and recovery is aborted. Some sample messages Jul 5 15:05:21 soak-8 kernel: Lustre: 4763:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery Jul 5 15:05:21 soak-8 kernel: Lustre: soaked-MDT0000: Recovery over after 2:18, of 37 clients 3 recovered and 34 were evicted. Jul 5 15:37:02 soak-8 kernel: Lustre: MGS: haven 't heard from client 1b0cd471-e969-acd0-bd39-45b87774c0c3 (at 192.168.1.107@o2ib) in 227 seconds. I think it' s dead, and I am evicting it. exp ffff8804b978ac00, cur 1499269022 expire 1499268872 last 1499268795 Jul 5 16:47:07 soak-8 kernel: Lustre: MGS: haven 't heard from client adfa67c6-0880-1a84-5ecf-172c556a5dd6 (at 192.168.1.111@o2ib) in 227 seconds. I think it' s dead, and I am evicting it. exp ffff8807dd474c00, cur 1499273227 expire 1499273077 last 1499273000 Jul 5 16:47:16 soak-8 kernel: Lustre: soaked-MDT0000: haven 't heard from client soaked-MDT0003-mdtlov_UUID (at 192.168.1.111@o2ib) in 228 seconds. I think it' s dead, and I am evicting it. exp ffff8807dea70400, cur 1499273236 expire 1499273086 last 1499273008 Jul 5 16:50:06 soak-8 kernel: LustreError: 167-0: soaked-MDT0003-osp-MDT0000: This client was evicted by soaked-MDT0003; in progress operations using this service will fail. Jul 5 17:18:19 soak-8 kernel: Lustre: MGS: haven 't heard from client 8fe08516-dbc0-85e3-6fb4-93c94095a877 (at 192.168.1.106@o2ib) in 227 seconds. I think it' s dead, and I am evicting it. exp ffff8807dd475800, cur 1499275099 expire 1499274949 last 1499274872 Jul 5 18:14:24 soak-8 kernel: Lustre: 4771:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery Jul 5 18:14:24 soak-8 kernel: Lustre: soaked-MDT0000: Denying connection for new client a1e375f2-7cfb-7e06-98b7-7ea8a7f2eca4(at 192.168.1.130@o2ib), waiting for 37 known clients (13 recovered, 0 in progress, and 24 evicted) to recover in 4:02 Jul 5 18:14:24 soak-8 kernel: Lustre: soaked-MDT0000: Recovery over after 11:00, of 37 clients 13 recovered and 24 were evicted. Jul 5 18:44:53 soak-8 kernel: Lustre: MGS: haven 't heard from client e13056ec-cbfc-e019-c014-c1a0cfd67815 (at 192.168.1.103@o2ib) in 229 seconds. I think it' s dead, and I am evicting it. exp ffff88017917cc00, cur 1499280293 expire 1499280143 last 1499280064 Jul 5 19:07:31 soak-8 kernel: Lustre: MGS: haven 't heard from client 9b21a845-8a0f-5a86-e846-32a309c2b726 (at 192.168.1.111@o2ib) in 233 seconds. I think it' s dead, and I am evicting it. exp ffff88017976dc00, cur 1499281651 expire 1499281501 last 1499281418 Jul 5 19:10:18 soak-8 kernel: LustreError: 167-0: soaked-MDT0003-osp-MDT0000: This client was evicted by soaked-MDT0003; in progress operations using this service will fail. Jul 5 19:35:10 soak-8 kernel: Lustre: MGS: haven 't heard from client 687666be-1c95-4869-bffb-261e5c2ec0e7 (at 192.168.1.111@o2ib) in 228 seconds. I think it' s dead, and I am evicting it. exp ffff8807dbcfe000, cur 1499283310 expire 1499283160 last 1499283082 Jul 5 19:37:55 soak-8 kernel: LustreError: 167-0: soaked-MDT0003-osp-MDT0000: This client was evicted by soaked-MDT0003; in progress operations using this service will fail. Jul 5 20:06:07 soak-8 kernel: Lustre: MGS: haven 't heard from client a43f28ec-dc91-d483-5a19-102f4ecfe255 (at 192.168.1.109@o2ib) in 228 seconds. I think it' s dead, and I am evicting it. exp ffff880404926c00, cur 1499285167 expire 1499285017 last 1499284939 Jul 5 20:07:15 soak-8 kernel: Lustre: 15754:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery Jul 5 20:08:00 soak-8 kernel: LustreError: 167-0: soaked-MDT0001-osp-MDT0000: This client was evicted by soaked-MDT0001; in progress operations using this service will fail. It would be good if someone could login and have a look.

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Cliff White (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 30/Mar/17 3:10 PM

Updated:: 03/Sep/18 8:01 AM

Resolved:: 18/Jul/17 1:39 PM