[LU-11761] blocked MDT mount and high cpu usage from lodXXXX_recYYYY threads Created: 11/Dec/18 Updated: 09/Dec/21 Resolved: 17/Jul/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Stephane Thiell | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.6 3.10.0-957.1.3.el7_lustre.x86_64 Lustre 2.12.0 RC2 |
||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
Another issue when using 2.12.0 RC2 during testing... MDTs mount seems to never complete and the following threads take 100% cpu: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20953 root 20 0 0 0 0 R 100.0 0.0 27:00.33 lod0002_rec0001 20954 root 20 0 0 0 0 R 100.0 0.0 27:00.34 lod0002_rec0003 This is on fir-md1-s1 that handles MDT0 and MDT2 on this test system. sysrq t shows: Dec 11 09:50:13 fir-md1-s1 kernel: lod0002_rec0001 R running task 0 20953 2 0x00000080 Dec 11 09:50:13 fir-md1-s1 kernel: Call Trace: Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0d3cc5d>] ? keys_fini+0x2d/0x1d0 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0d3ce2b>] lu_context_fini+0x2b/0xa0 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0d3d0da>] lu_env_init+0x1a/0x30 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0f19b68>] ptlrpc_set_wait+0x7d8/0x8d0 [ptlrpc] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0d515e5>] ? lustre_get_jobid+0x185/0x2e0 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0d09f3c>] ? obd_get_request_slot+0x3c/0x280 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0f19ce3>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc088e334>] fld_client_rpc+0x104/0x540 [fld] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0892f5f>] fld_server_lookup+0x15f/0x320 [fld] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc1684587>] lod_fld_lookup+0x327/0x510 [lod] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc16997dd>] lod_object_init+0x7d/0x3c0 [lod] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0d3dfd5>] lu_object_alloc+0xe5/0x320 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0d3e2e6>] lu_object_find_at+0x76/0x280 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0d3f78d>] dt_locate_at+0x1d/0xb0 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0d02b4c>] llog_osd_open+0xfc/0xf30 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0d3e789>] ? lu_object_put+0x279/0x3d0 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0ceff20>] llog_open+0x140/0x3d0 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc16bdeed>] lod_sub_prep_llog+0x14d/0x783 [lod] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc16837ab>] lod_sub_recovery_thread+0x1cb/0xc80 [lod] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc16835e0>] ? lod_obd_get_info+0x9d0/0x9d0 [lod] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffa7ac1c31>] kthread+0xd1/0xe0 Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffa7ac1b60>] ? insert_kthread_work+0x40/0x40 Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffa8174c24>] ret_from_fork_nospec_begin+0xe/0x21 Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffa7ac1b60>] ? insert_kthread_work+0x40/0x40 and Dec 11 09:44:24 fir-md1-s1 kernel: lod0002_rec0003 R running task 0 20954 2 0x00000080 Dec 11 09:44:24 fir-md1-s1 kernel: Call Trace: Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0d3cfa3>] ? lu_context_init+0xd3/0x1f0 [obdclass] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0d3ceba>] ? lu_env_fini+0x1a/0x30 [obdclass] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0f19b68>] ? ptlrpc_set_wait+0x7d8/0x8d0 [ptlrpc] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0d515e5>] ? lustre_get_jobid+0x185/0x2e0 [obdclass] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0d09f3c>] ? obd_get_request_slot+0x3c/0x280 [obdclass] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0f19ce3>] ? ptlrpc_queue_wait+0x83/0x230 [ptlrpc] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc088e421>] ? fld_client_rpc+0x1f1/0x540 [fld] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0892f5f>] ? fld_server_lookup+0x15f/0x320 [fld] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc1684587>] ? lod_fld_lookup+0x327/0x510 [lod] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc16997dd>] ? lod_object_init+0x7d/0x3c0 [lod] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0d3dfd5>] ? lu_object_alloc+0xe5/0x320 [obdclass] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0d3e2e6>] ? lu_object_find_at+0x76/0x280 [obdclass] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0d3f78d>] ? dt_locate_at+0x1d/0xb0 [obdclass] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0d02b4c>] ? llog_osd_open+0xfc/0xf30 [obdclass] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0d3e789>] ? lu_object_put+0x279/0x3d0 [obdclass] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0ceff20>] ? llog_open+0x140/0x3d0 [obdclass] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc16bdeed>] ? lod_sub_prep_llog+0x14d/0x783 [lod] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc16837ab>] ? lod_sub_recovery_thread+0x1cb/0xc80 [lod] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc16835e0>] ? lod_obd_get_info+0x9d0/0x9d0 [lod] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffa7ac1c31>] ? kthread+0xd1/0xe0 Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffa7ac1b60>] ? insert_kthread_work+0x40/0x40 Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffa8174c24>] ? ret_from_fork_nospec_begin+0xe/0x21 Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffa7ac1b60>] ? insert_kthread_work+0x40/0x40 Mount commands are stuck, even when using -o abort_recov I took a crash dump just in case you're interested. I believe it's a regression from earlier 2.11.x versions... HTH, |
| Comments |
| Comment by Peter Jones [ 11/Dec/18 ] |
|
Lai Could you please make an assessment of this issue? Does this seem to be fallout from Stephane Could you please attach the debug logs to the Jira ticket and upload the crash dump to the Whamcloud ftp site? Thanks Peter |
| Comment by Stephane Thiell [ 11/Dec/18 ] |
|
Sure, I uploaded the vmcore and corresponding debuginfo rpms to ftp.whamcloud.com: * vmcore as debuginfo rpms as:
And yes this could be a fallout from |
| Comment by Lai Siyao [ 12/Dec/18 ] |
|
Peter, this looks to be caused by corrupt update log, which may be generated by Stephane, did you use multiple MDT before? the update log is used on DNE system only. |
| Comment by Lai Siyao [ 17/Dec/18 ] |
|
This will be fixed in |
| Comment by Stephane Thiell [ 28/Apr/19 ] |
|
Hit this again when mounting MDTs with -o abort_recov Apr 28 10:56:16 fir-md1-s1 kernel: Lustre: fir-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 Apr 28 10:56:17 fir-md1-s1 kernel: Lustre: fir-MDD0002: changelog on Apr 28 10:56:17 fir-md1-s1 kernel: Lustre: fir-MDT0002: in recovery but waiting for the first client to connect Apr 28 10:56:17 fir-md1-s1 kernel: LustreError: 39987:0:(mdt_handler.c:6447:mdt_iocontrol()) fir-MDT0002: Aborting recovery for device Apr 28 10:56:17 fir-md1-s1 kernel: LustreError: 39987:0:(ldlm_lib.c:2605:target_stop_recovery_thread()) fir-MDT0002: Aborting recovery Apr 28 10:56:17 fir-md1-s1 kernel: Lustre: Skipped 2 previous similar messages Apr 28 10:56:17 fir-md1-s1 kernel: Lustre: 40232:0:(ldlm_lib.c:2058:target_recovery_overseer()) recovery is aborted, evict exports in recovery Apr 28 10:56:17 fir-md1-s1 kernel: Lustre: 40232:0:(ldlm_lib.c:2058:target_recovery_overseer()) Skipped 2 previous similar messages Apr 28 10:56:23 fir-md1-s1 kernel: LustreError: 23568:0:(ldlm_lockd.c:2322:ldlm_cancel_handler()) ldlm_cancel from 10.9.101.9@o2ib4 arrived at 1556474183 with bad export cookie 3999205256457834672 Apr 28 10:56:23 fir-md1-s1 kernel: LustreError: 23568:0:(ldlm_lockd.c:2322:ldlm_cancel_handler()) Skipped 2 previous similar messages Apr 28 10:58:55 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.9.105.54@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other Apr 28 10:58:55 fir-md1-s1 kernel: LustreError: Skipped 16004 previous similar messages Apr 28 11:00:08 fir-md1-s1 kernel: INFO: task mount.lustre:39987 blocked for more than 120 seconds. Apr 28 11:00:08 fir-md1-s1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 28 11:00:08 fir-md1-s1 kernel: mount.lustre D ffff8b625635b0c0 0 39987 39986 0x00000082 Apr 28 11:00:08 fir-md1-s1 kernel: Call Trace: Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9bb67c49>] schedule+0x29/0x70 Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9bb65721>] schedule_timeout+0x221/0x2d0 Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9bb67ffd>] wait_for_completion+0xfd/0x140 Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9b4d67b0>] ? wake_up_state+0x20/0x20 Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffffc0d6936d>] target_stop_recovery_thread.part.18+0x3d/0xd0 [ptlrpc] Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffffc0d69418>] target_stop_recovery_thread+0x18/0x20 [ptlrpc] Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffffc146e868>] mdt_iocontrol+0x558/0xb00 [mdt] Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffffc0b9a766>] server_start_targets+0x1c66/0x2a20 [obdclass] Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffffc0b6eb10>] ? lustre_start_mgc+0x260/0x2510 [obdclass] Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffffc0b67ea0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffffc0b9c5ec>] server_fill_super+0x10cc/0x1890 [obdclass] Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffffc0b71828>] lustre_fill_super+0x328/0x950 [obdclass] Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffffc0b71500>] ? lustre_common_put_super+0x270/0x270 [obdclass] Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9b6452cf>] mount_nodev+0x4f/0xb0 Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffffc0b69908>] lustre_mount+0x38/0x60 [obdclass] Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9b645e4e>] mount_fs+0x3e/0x1b0 Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9b6639e7>] vfs_kern_mount+0x67/0x110 Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9b66600f>] do_mount+0x1ef/0xce0 Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9b665d45>] ? copy_mount_options+0xc5/0x170 Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9b666e43>] SyS_mount+0x83/0xd0 Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9bb74ddb>] system_call_fastpath+0x22/0x27 Apr 28 11:00:08 fir-md1-s1 kernel: INFO: task mount.lustre:40007 blocked for more than 120 seconds. Apr 28 11:00:08 fir-md1-s1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 28 11:00:08 fir-md1-s1 kernel: mount.lustre D ffff8b625635e180 0 40007 40006 0x00000082 Apr 28 11:00:08 fir-md1-s1 kernel: Call Trace: Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9bb68b69>] schedule_preempt_disabled+0x29/0x70 Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9bb66ab7>] __mutex_lock_slowpath+0xc7/0x1d0 Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9bb65e9f>] mutex_lock+0x1f/0x2f Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffffc104f398>] mgc_set_info_async+0xa98/0x15f0 [mgc] Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffffc0b73bec>] ? keys_fill+0xfc/0x180 [obdclass] Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffffc0b98e1a>] server_start_targets+0x31a/0x2a20 [obdclass] Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffffc0b6eb10>] ? lustre_start_mgc+0x260/0x2510 [obdclass] Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffffc0b69f90>] ? do_lcfg+0x2f0/0x500 [obdclass] Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffffc0b9c5ec>] server_fill_super+0x10cc/0x1890 [obdclass] Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffffc0b71828>] lustre_fill_super+0x328/0x950 [obdclass] Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffffc0b71500>] ? lustre_common_put_super+0x270/0x270 [obdclass] Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9b6452cf>] mount_nodev+0x4f/0xb0 Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffffc0b69908>] lustre_mount+0x38/0x60 [obdclass] Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9b645e4e>] mount_fs+0x3e/0x1b0 Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9b6639e7>] vfs_kern_mount+0x67/0x110 Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9b66600f>] do_mount+0x1ef/0xce0 Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9b63e2aa>] ? __check_object_size+0x1ca/0x250 Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9b61c74c>] ? kmem_cache_alloc_trace+0x3c/0x200 Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9b666e43>] SyS_mount+0x83/0xd0 Apr 28 11:00:08 fir-md1-s1 kernel: [<ffffffff9bb74ddb>] system_call_fastpath+0x22/0x27 top: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 40230 root 20 0 0 0 0 R 100.0 0.0 4:14.53 lod0002_rec0001 |
| Comment by Hongchao Zhang [ 29/Apr/19 ] |
|
In fld_client_rpc, the FLD_QUERY request will be retried in a loop if the connection to MDT0000 is not completed. int fld_client_rpc(struct obd_export *exp,
struct lu_seq_range *range, u32 fld_op,
struct ptlrpc_request **reqp)
{
...
if (rc != 0) {
if (imp->imp_state != LUSTRE_IMP_CLOSED &&
!imp->imp_deactive &&
imp->imp_connect_flags_orig & OBD_CONNECT_MDS_MDS &&
OCD_HAS_FLAG(&imp->imp_connect_data, LIGHTWEIGHT) &&
rc != -ENOTSUPP) {
/*
* Since LWP is not replayable, so it will keep
* trying unless umount happens or the remote
* target does not support the operation, otherwise
* it would cause unecessary failure of the
* application.
*/
ptlrpc_req_finished(req);
rc = 0;
goto again;
}
GOTO(out_req, rc);
...
In this case, the connection to the MDT0000 is actually lost, then the kernel thread "lodxxxx_recxxxx" is stuck in fld_client_rpc. Apr 28 10:58:55 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.9.105.54@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other Apr 28 10:58:55 fir-md1-s1 kernel: LustreError: Skipped 16004 previous similar messages |
| Comment by Gerrit Updater [ 27/May/19 ] |
|
Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34962 |
| Comment by Stephane Thiell [ 21/Jun/19 ] |
|
Thanks Hongchao! FYI I'll use your current patch in case of emergency (eg. MDT restart) on our system, and rebuild when it's fully ready. We need to be able to abort the MDT recovery or better, the MDT recovery needs to be able to timeout, which is not always the case today as described in LU-12360. |
| Comment by Gerrit Updater [ 17/Jul/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34962/ |
| Comment by Peter Jones [ 17/Jul/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 01/Aug/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35661 |
| Comment by Stephane Thiell [ 04/Sep/19 ] |
|
Is there any chance to land this one in 2.12.3? |
| Comment by Peter Jones [ 04/Sep/19 ] |
|
Yes it is quite likely this will end up in 2.12.3 |
| Comment by Stephane Thiell [ 05/Sep/19 ] |
|
Great! With this patch on top of current b2_12 (so quite a lot of changes for us, not only this patch), the recovery worked on all 4 MDTs: fir-md1-s2: Sep 05 05:59:04 fir-md1-s2 kernel: Lustre: fir-MDT0001: Recovery over after 5:26, of 1378 clients 1378 recovered and 0 were evicted. fir-md1-s2: Sep 05 05:59:11 fir-md1-s2 kernel: Lustre: fir-MDT0003: Recovery over after 0:32, of 1378 clients 1378 recovered and 0 were evicted. fir-md1-s1: Sep 05 05:59:07 fir-md1-s1 kernel: Lustre: fir-MDT0002: Recovery over after 5:26, of 1379 clients 1379 recovered and 0 were evicted. fir-md1-s1: Sep 05 05:59:25 fir-md1-s1 kernel: Lustre: fir-MDT0000: Recovery over after 0:43, of 1379 clients 1379 recovered and 0 were evicted. Minor thing, two MDTs took some more time to mount, but after some 300s timeout it worked: [ 4563.995907] LustreError: 39383:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567688019, 300s ago), entering recovery for MGS@10.0.10.51@o2ib7 ns: MGC10.0.10.51@o2ib7 lock: ffff8fb0c2240480/0x5731634ee5f506c6 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 3 type: PLN flags: 0x1000000000000 nid: local remote: 0x98816ce1399443f3 expref: -99 pid: 39383 timeout: 0 lvb_type: 0 [ 4564.033287] LustreError: 40296:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff8fb0c1334600) refcount nonzero (2) after lock cleanup; forcing cleanup. [ 4564.052847] Lustre: MGC10.0.10.51@o2ib7: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) [ 4564.062155] Lustre: Skipped 1 previous similar message [ 4564.070150] Lustre: fir-MDT0003: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [ 4564.323008] Lustre: fir-MDD0003: changelog on [ 4564.332139] Lustre: fir-MDT0003: in recovery but waiting for the first client to connect [ 4564.429981] Lustre: fir-MDT0003: Will be in recovery for at least 2:30, or until 1378 clients reconnect In case you want to investigate this timeout issue, the kernel logs of the successful start of 2.12.2_119 on our two MDSes are attached as: But no 100% loaded lod* threads anymore. Not even a single eviction. This is awesome. Thanks all! |
| Comment by Gerrit Updater [ 12/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35661/ |