[LU-11761] blocked MDT mount and high cpu usage from lodXXXX_recYYYY threads Created: 11/Dec/18  Updated: 09/Dec/21  Resolved: 17/Jul/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS 7.6 3.10.0-957.1.3.el7_lustre.x86_64 Lustre 2.12.0 RC2


Attachments: Text File fir-md1-s1_2.12.2_119_SRCC.log     Text File fir-md1-s2_2.12.2_119_SRCC.log    
Issue Links:
Duplicate
is duplicated by LU-10401 sanity test_133g: timeout during MDT ... Resolved
Related
is related to LU-5871 Do not return -EAGAIN in lod_object_init Resolved
is related to LU-12360 Can't restart filesystem (2.12) even ... Reopened
is related to LU-13468 Timeout-ed FLD_QUERY rpc leads to cli... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Another issue when using 2.12.0 RC2 during testing... MDTs mount seems to never complete and the following threads take 100% cpu:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 20953 root 20 0 0 0 0 R 100.0 0.0 27:00.33 lod0002_rec0001
 20954 root 20 0 0 0 0 R 100.0 0.0 27:00.34 lod0002_rec0003

This is on fir-md1-s1 that handles MDT0 and MDT2 on this test system.

sysrq t shows:

Dec 11 09:50:13 fir-md1-s1 kernel: lod0002_rec0001 R  running task        0 20953      2 0x00000080
Dec 11 09:50:13 fir-md1-s1 kernel: Call Trace:
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3cc5d>] ? keys_fini+0x2d/0x1d0 [obdclass]
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3ce2b>] lu_context_fini+0x2b/0xa0 [obdclass]
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3d0da>] lu_env_init+0x1a/0x30 [obdclass]
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0f19b68>] ptlrpc_set_wait+0x7d8/0x8d0 [ptlrpc]
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d515e5>] ? lustre_get_jobid+0x185/0x2e0 [obdclass]
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d09f3c>] ? obd_get_request_slot+0x3c/0x280 [obdclass]
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0f19ce3>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc088e334>] fld_client_rpc+0x104/0x540 [fld]
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0892f5f>] fld_server_lookup+0x15f/0x320 [fld]
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc1684587>] lod_fld_lookup+0x327/0x510 [lod]
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc16997dd>] lod_object_init+0x7d/0x3c0 [lod]
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3dfd5>] lu_object_alloc+0xe5/0x320 [obdclass]
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3e2e6>] lu_object_find_at+0x76/0x280 [obdclass]
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3f78d>] dt_locate_at+0x1d/0xb0 [obdclass]
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d02b4c>] llog_osd_open+0xfc/0xf30 [obdclass]
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3e789>] ? lu_object_put+0x279/0x3d0 [obdclass]
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0ceff20>] llog_open+0x140/0x3d0 [obdclass]
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc16bdeed>] lod_sub_prep_llog+0x14d/0x783 [lod]
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc16837ab>] lod_sub_recovery_thread+0x1cb/0xc80 [lod]
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc16835e0>] ? lod_obd_get_info+0x9d0/0x9d0 [lod]
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffa7ac1c31>] kthread+0xd1/0xe0
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffa7ac1b60>] ? insert_kthread_work+0x40/0x40
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffa8174c24>] ret_from_fork_nospec_begin+0xe/0x21
Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffa7ac1b60>] ? insert_kthread_work+0x40/0x40

and

Dec 11 09:44:24 fir-md1-s1 kernel: lod0002_rec0003 R  running task        0 20954      2 0x00000080
Dec 11 09:44:24 fir-md1-s1 kernel: Call Trace:
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d3cfa3>] ? lu_context_init+0xd3/0x1f0 [obdclass]
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d3ceba>] ? lu_env_fini+0x1a/0x30 [obdclass]
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0f19b68>] ? ptlrpc_set_wait+0x7d8/0x8d0 [ptlrpc]
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d515e5>] ? lustre_get_jobid+0x185/0x2e0 [obdclass]
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d09f3c>] ? obd_get_request_slot+0x3c/0x280 [obdclass]
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0f19ce3>] ? ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc088e421>] ? fld_client_rpc+0x1f1/0x540 [fld]
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0892f5f>] ? fld_server_lookup+0x15f/0x320 [fld]
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc1684587>] ? lod_fld_lookup+0x327/0x510 [lod]
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc16997dd>] ? lod_object_init+0x7d/0x3c0 [lod]
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d3dfd5>] ? lu_object_alloc+0xe5/0x320 [obdclass]
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d3e2e6>] ? lu_object_find_at+0x76/0x280 [obdclass]
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d3f78d>] ? dt_locate_at+0x1d/0xb0 [obdclass]
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d02b4c>] ? llog_osd_open+0xfc/0xf30 [obdclass]
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d3e789>] ? lu_object_put+0x279/0x3d0 [obdclass]
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0ceff20>] ? llog_open+0x140/0x3d0 [obdclass]
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc16bdeed>] ? lod_sub_prep_llog+0x14d/0x783 [lod]
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc16837ab>] ? lod_sub_recovery_thread+0x1cb/0xc80 [lod]
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc16835e0>] ? lod_obd_get_info+0x9d0/0x9d0 [lod]
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffa7ac1c31>] ? kthread+0xd1/0xe0
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffa7ac1b60>] ? insert_kthread_work+0x40/0x40
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffa8174c24>] ? ret_from_fork_nospec_begin+0xe/0x21
Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffa7ac1b60>] ? insert_kthread_work+0x40/0x40

Mount commands are stuck, even when using -o abort_recov

I took a crash dump just in case you're interested.

I believe it's a regression from earlier 2.11.x versions...

HTH,
Stephane



 Comments   
Comment by Peter Jones [ 11/Dec/18 ]

Lai

Could you please make an assessment of this issue? Does this seem to be fallout from LU-11753?

Stephane

Could you please attach the debug logs to the Jira ticket and upload the crash dump to the Whamcloud ftp site?

Thanks

Peter

Comment by Stephane Thiell [ 11/Dec/18 ]

Sure, I uploaded the vmcore and corresponding debuginfo rpms to ftp.whamcloud.com:

* vmcore as LU-11761-vmcore-fir-md1-s1-2018-12-11-10-05-46

debuginfo rpms as:

  • lustre-debuginfo-2.12.0_RC2-1.el7.x86_64.rpm
  • kernel-debuginfo-3.10.0-957.1.3.el7_lustre.x86_64.rpm.1 (please disregard kernel-debuginfo-3.10.0-957.1.3.el7_lustre.x86_64.rpm)
  • kernel-debuginfo-common-x86_64-3.10.0-957.1.3.el7_lustre.x86_64.rpm

And yes this could be a fallout from LU-11753 as I started having this issue after the problematic lfs migrate -m ...

Comment by Lai Siyao [ 12/Dec/18 ]

Peter, this looks to be caused by corrupt update log, which may be generated by LU-11753. I'll check logs for details.

Stephane, did you use multiple MDT before? the update log is used on DNE system only.

Comment by Lai Siyao [ 17/Dec/18 ]

This will be fixed in LU-10888, but it's not finished yet.

Comment by Stephane Thiell [ 28/Apr/19 ]

Hit this again when mounting MDTs with -o abort_recov

Apr 28 10:56:16 fir-md1-s1 kernel: Lustre: fir-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
Apr 28 10:56:17 fir-md1-s1 kernel: Lustre: fir-MDD0002: changelog on
Apr 28 10:56:17 fir-md1-s1 kernel: Lustre: fir-MDT0002: in recovery but waiting for the first client to connect
Apr 28 10:56:17 fir-md1-s1 kernel: LustreError: 39987:0:(mdt_handler.c:6447:mdt_iocontrol()) fir-MDT0002: Aborting recovery for device
Apr 28 10:56:17 fir-md1-s1 kernel: LustreError: 39987:0:(ldlm_lib.c:2605:target_stop_recovery_thread()) fir-MDT0002: Aborting recovery
Apr 28 10:56:17 fir-md1-s1 kernel: Lustre: Skipped 2 previous similar messages
Apr 28 10:56:17 fir-md1-s1 kernel: Lustre: 40232:0:(ldlm_lib.c:2058:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Apr 28 10:56:17 fir-md1-s1 kernel: Lustre: 40232:0:(ldlm_lib.c:2058:target_recovery_overseer()) Skipped 2 previous similar messages
Apr 28 10:56:23 fir-md1-s1 kernel: LustreError: 23568:0:(ldlm_lockd.c:2322:ldlm_cancel_handler()) ldlm_cancel from 10.9.101.9@o2ib4 arrived at 1556474183 with bad export cookie 3999205256457834672
Apr 28 10:56:23 fir-md1-s1 kernel: LustreError: 23568:0:(ldlm_lockd.c:2322:ldlm_cancel_handler()) Skipped 2 previous similar messages
Apr 28 10:58:55 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.9.105.54@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other 
Apr 28 10:58:55 fir-md1-s1 kernel: LustreError: Skipped 16004 previous similar messages
Apr 28 11:00:08 fir-md1-s1 kernel: INFO: task mount.lustre:39987 blocked for more than 120 seconds.
Apr 28 11:00:08 fir-md1-s1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 28 11:00:08 fir-md1-s1 kernel: mount.lustre    D ffff8b625635b0c0     0 39987  39986 0x00000082
Apr 28 11:00:08 fir-md1-s1 kernel: Call Trace:
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9bb67c49>] schedule+0x29/0x70
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9bb65721>] schedule_timeout+0x221/0x2d0
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9bb67ffd>] wait_for_completion+0xfd/0x140
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9b4d67b0>] ? wake_up_state+0x20/0x20
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffffc0d6936d>] target_stop_recovery_thread.part.18+0x3d/0xd0 [ptlrpc]
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffffc0d69418>] target_stop_recovery_thread+0x18/0x20 [ptlrpc]
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffffc146e868>] mdt_iocontrol+0x558/0xb00 [mdt]
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffffc0b9a766>] server_start_targets+0x1c66/0x2a20 [obdclass]
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffffc0b6eb10>] ? lustre_start_mgc+0x260/0x2510 [obdclass]
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffffc0b67ea0>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffffc0b9c5ec>] server_fill_super+0x10cc/0x1890 [obdclass]
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffffc0b71828>] lustre_fill_super+0x328/0x950 [obdclass]
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffffc0b71500>] ? lustre_common_put_super+0x270/0x270 [obdclass]
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9b6452cf>] mount_nodev+0x4f/0xb0
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffffc0b69908>] lustre_mount+0x38/0x60 [obdclass]
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9b645e4e>] mount_fs+0x3e/0x1b0
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9b6639e7>] vfs_kern_mount+0x67/0x110
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9b66600f>] do_mount+0x1ef/0xce0
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9b665d45>] ? copy_mount_options+0xc5/0x170
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9b666e43>] SyS_mount+0x83/0xd0
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9bb74ddb>] system_call_fastpath+0x22/0x27
Apr 28 11:00:08 fir-md1-s1 kernel: INFO: task mount.lustre:40007 blocked for more than 120 seconds.
Apr 28 11:00:08 fir-md1-s1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 28 11:00:08 fir-md1-s1 kernel: mount.lustre    D ffff8b625635e180     0 40007  40006 0x00000082
Apr 28 11:00:08 fir-md1-s1 kernel: Call Trace:
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9bb68b69>] schedule_preempt_disabled+0x29/0x70
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9bb66ab7>] __mutex_lock_slowpath+0xc7/0x1d0
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9bb65e9f>] mutex_lock+0x1f/0x2f
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffffc104f398>] mgc_set_info_async+0xa98/0x15f0 [mgc]
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffffc0b73bec>] ? keys_fill+0xfc/0x180 [obdclass]
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffffc0b98e1a>] server_start_targets+0x31a/0x2a20 [obdclass]
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffffc0b6eb10>] ? lustre_start_mgc+0x260/0x2510 [obdclass]
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffffc0b69f90>] ? do_lcfg+0x2f0/0x500 [obdclass]
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffffc0b9c5ec>] server_fill_super+0x10cc/0x1890 [obdclass]
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffffc0b71828>] lustre_fill_super+0x328/0x950 [obdclass]
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffffc0b71500>] ? lustre_common_put_super+0x270/0x270 [obdclass]
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9b6452cf>] mount_nodev+0x4f/0xb0
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffffc0b69908>] lustre_mount+0x38/0x60 [obdclass]
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9b645e4e>] mount_fs+0x3e/0x1b0
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9b6639e7>] vfs_kern_mount+0x67/0x110
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9b66600f>] do_mount+0x1ef/0xce0
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9b63e2aa>] ? __check_object_size+0x1ca/0x250
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9b61c74c>] ? kmem_cache_alloc_trace+0x3c/0x200
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9b666e43>] SyS_mount+0x83/0xd0
Apr 28 11:00:08 fir-md1-s1 kernel:  [<ffffffff9bb74ddb>] system_call_fastpath+0x22/0x27

top:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                      
 40230 root      20   0       0      0      0 R 100.0  0.0   4:14.53 lod0002_rec0001  
Comment by Hongchao Zhang [ 29/Apr/19 ]

In fld_client_rpc, the FLD_QUERY request will be retried in a loop if the connection to MDT0000 is not completed.

int fld_client_rpc(struct obd_export *exp,
                   struct lu_seq_range *range, u32 fld_op,
                   struct ptlrpc_request **reqp)
{
            ...
            if (rc != 0) {
                if (imp->imp_state != LUSTRE_IMP_CLOSED &&
                    !imp->imp_deactive &&
                    imp->imp_connect_flags_orig & OBD_CONNECT_MDS_MDS &&
                    OCD_HAS_FLAG(&imp->imp_connect_data, LIGHTWEIGHT) &&
                    rc != -ENOTSUPP) {
                        /*
                         * Since LWP is not replayable, so it will keep
                         * trying unless umount happens or the remote
                         * target does not support the operation, otherwise
                         * it would cause unecessary failure of the
                         * application.
                         */
                        ptlrpc_req_finished(req);
                        rc = 0;
                        goto again;
                }
                GOTO(out_req, rc);
                ...

In this case, the connection to the MDT0000 is actually lost, then the kernel thread "lodxxxx_recxxxx" is stuck in fld_client_rpc.

Apr 28 10:58:55 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.9.105.54@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other 
Apr 28 10:58:55 fir-md1-s1 kernel: LustreError: Skipped 16004 previous similar messages
Comment by Gerrit Updater [ 27/May/19 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34962
Subject: LU-11761 fld: limit FLD_QUERY retry between MDTs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f2c831def8ac010b7a913633141deb2b5384e06c

Comment by Stephane Thiell [ 21/Jun/19 ]

Thanks Hongchao! FYI I'll use your current patch in case of emergency (eg. MDT restart) on our system, and rebuild when it's fully ready. We need to be able to abort the MDT recovery or better, the MDT recovery needs to be able to timeout, which is not always the case today as described in LU-12360.

Comment by Gerrit Updater [ 17/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34962/
Subject: LU-11761 fld: let's caller to retry FLD_QUERY
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e3f6111dfd1c6f2266d0beef67e5a7514a6965d0

Comment by Peter Jones [ 17/Jul/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 01/Aug/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35661
Subject: LU-11761 fld: let's caller to retry FLD_QUERY
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: fe3d5156ac95e58c5fc62ab503dcff3619ccfbe0

Comment by Stephane Thiell [ 04/Sep/19 ]

Is there any chance to land this one in 2.12.3?

Comment by Peter Jones [ 04/Sep/19 ]

Yes it is quite likely this will end up in 2.12.3

Comment by Stephane Thiell [ 05/Sep/19 ]

Great! With this patch on top of current b2_12 (so quite a lot of changes for us, not only this patch), the recovery worked on all 4 MDTs:

fir-md1-s2: Sep 05 05:59:04 fir-md1-s2 kernel: Lustre: fir-MDT0001: Recovery over after 5:26, of 1378 clients 1378 recovered and 0 were evicted.
fir-md1-s2: Sep 05 05:59:11 fir-md1-s2 kernel: Lustre: fir-MDT0003: Recovery over after 0:32, of 1378 clients 1378 recovered and 0 were evicted.
fir-md1-s1: Sep 05 05:59:07 fir-md1-s1 kernel: Lustre: fir-MDT0002: Recovery over after 5:26, of 1379 clients 1379 recovered and 0 were evicted.
fir-md1-s1: Sep 05 05:59:25 fir-md1-s1 kernel: Lustre: fir-MDT0000: Recovery over after 0:43, of 1379 clients 1379 recovered and 0 were evicted.

Minor thing, two MDTs took some more time to mount, but after some 300s timeout it worked:

[ 4563.995907] LustreError: 39383:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567688019, 300s ago), entering recovery for MGS@10.0.10.51@o2ib7 ns: MGC10.0.10.51@o2ib7 lock: ffff8fb0c2240480/0x5731634ee5f506c6 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 3 type: PLN flags: 0x1000000000000 nid: local remote: 0x98816ce1399443f3 expref: -99 pid: 39383 timeout: 0 lvb_type: 0
[ 4564.033287] LustreError: 40296:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff8fb0c1334600) refcount nonzero (2) after lock cleanup; forcing cleanup.
[ 4564.052847] Lustre: MGC10.0.10.51@o2ib7: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7)
[ 4564.062155] Lustre: Skipped 1 previous similar message
[ 4564.070150] Lustre: fir-MDT0003: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
[ 4564.323008] Lustre: fir-MDD0003: changelog on
[ 4564.332139] Lustre: fir-MDT0003: in recovery but waiting for the first client to connect
[ 4564.429981] Lustre: fir-MDT0003: Will be in recovery for at least 2:30, or until 1378 clients reconnect

In case you want to investigate this timeout issue, the kernel logs of the successful start of 2.12.2_119 on our two MDSes are attached as:
fir-md1-s1_2.12.2_119_SRCC.log and fir-md1-s2_2.12.2_119_SRCC.log

But no 100% loaded lod* threads anymore. Not even a single eviction. This is awesome. Thanks all!

Comment by Gerrit Updater [ 12/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35661/
Subject: LU-11761 fld: let's caller to retry FLD_QUERY
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 80eb31f5367f5b879b544a61a938a8acd7de9cf5

Generated at Sat Feb 10 02:46:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.