[LU-15934] client refused mount with -EAGAIN because of missing MDT-MDT connection Created: 12/Jun/22 Updated: 20/Dec/23 Resolved: 28/Jun/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Andreas Dilger | Assignee: | Yang Sheng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
New clients were unable to establish a connection to the MDT, even after recovery had been aborted due to an llog context not being set up properly. The clients were permanently getting -11 = -EAGAIN errors from the server: (service.c:2298:ptlrpc_server_handle_request()) Handling RPC req@ffff8cdd37ad0d80 pname:cluuid+ref:pid:xid:nid:opc:job mdt09_0 01:0+-99:4093:x1719493089340032:12345-10.16.172.159@tcp:38: (service.c:2303:ptlrpc_server_handle_request()) got req 1719493089340032 (tgt_handler.c:736:tgt_request_handle()) Process entered (ldlm_lib.c:1100:target_handle_connect()) Process entered (ldlm_lib.c:1360:target_handle_connect()) lfs02-MDT0003: connection from 16778a5c-5128-4231-8b45-426adc7e94b6@10.16.172.159@tcp t55835524055 exp (null) cur 51537 last 0 (obd_class.h:831:obd_connect()) Process entered (mdt_handler.c:6671:mdt_obd_connect()) Process entered (lod_dev.c:2136:lod_obd_get_info()) lfs02-MDT0003-mdtlov: lfs02-MDT0001-osp-MDT0003 is not ready. (lod_dev.c:2145:lod_obd_get_info()) Process leaving (rc=18446744073709551605 : -11 : fffffffffffffff5) (ldlm_lib.c:1446:target_handle_connect()) Process leaving via out (rc=18446744073709551605 : -11 : 0xfffffffffffffff5) (service.c:2347:ptlrpc_server_handle_request()) Handled RPC req@ffff8cdd37ad0d80 pname:cluuid+ref:pid:xid:nid:opc:job mdt09_001:0+-99:4093:x1719493089340032:12345-10.16.172.159@tcp:38: Request processed in 86us (124us total) trans 0 rc -11/-11 This corresponds to the following block of code in lod_obd_get_info(), where it is the second "is not ready" message being printed from the missing ctxt->loc_handle:
lod_foreach_mdt(d, tgt) {
struct llog_ctxt *ctxt;
if (!tgt->ltd_active)
continue;
ctxt = llog_get_context(tgt->ltd_tgt->dd_lu_dev.ld_obd,
LLOG_UPDATELOG_ORIG_CTXT);
if (!ctxt) {
CDEBUG(D_INFO, "%s: %s is not ready.\n",
obd->obd_name,
tgt->ltd_tgt->dd_lu_dev.ld_obd->obd_name);
rc = -EAGAIN;
break;
}
if (!ctxt->loc_handle) {
CDEBUG(D_INFO, "%s: %s is not ready.\n",
obd->obd_name,
tgt->ltd_tgt->dd_lu_dev.ld_obd->obd_name);
rc = -EAGAIN;
llog_ctxt_put(ctxt);
break;
}
llog_ctxt_put(ctxt);
}
It would be useful to distinguish those two messages more clearly, e.g. "ctxt is not ready" and "handle is not ready", as minor differences in line numbers would make it difficult to distinguish them in the logs. The root problem is that the MDT0003-MDT0001 connection wasn't completely set up due to abort_recovery_mdt (due to a different recovery error, I think there are two issues to be addressed here: |
| Comments |
| Comment by Andreas Dilger [ 14/Dec/22 ] |
|
Hit the same issue on another system. [OI Scrub running in a loop because FID is missing] : 1670963811.447855:0:28018:0:(client.c:1498:after_reply()) @@@ resending request on EINPROGRESS req@ffff8ec49336cc80 x1752076841808640/t0(0) o1000->fs01-MDT0001-osp-MDT0000@172.16.1.10@o2ib:24/4 lens 304/4320 e 0 to 0 dl 1670963849 ref 2 fl Rpc:RQU/2/0 rc 0/-115 job:'' : [OI Scrub is killed] : 1670963903.447724:0:28018:0:(osp_object.c:596:osp_attr_get()) fs01-MDT0001-osp-MDT0000:osp_attr_get update error [0x900000404:0x1:0x0]: rc = -78 1670963903.447734:0:28018:0:(lod_dev.c:425:lod_sub_recovery_thread()) fs01-MDT0001-osp-MDT0000 get update log failed: rc = -78 1670966282.324977:0:26880:0:(lod_dev.c:2136:lod_obd_get_info()) fs01-MDT0000-mdtlov: fs01-MDT0001-osp-MDT0000 is not ready. Later when a client tries to mount the filesystem it fails due to the bad llog state causing the MDT to refuse all new connections: 1670966308.517999:0:29630:0:(service.c:2298:ptlrpc_server_handle_request()) Handling RPC req@ffff8ec6734bf500 pname:cluuid+ref:pid:xid:nid:opc:job mdt02_003:0+-99:13925:x1752076850908352:12345-0@lo:38: 1670966308.518019:0:29630:0:(ldlm_lib.c:1360:target_handle_connect()) fs01-MDT0000: connection from 5c5c267c-0fa0-4acb-b884-d5ce8cae08c2@0@lo t0 exp (null) cur 55901 last 0 1670966308.518036:0:29630:0:(lod_dev.c:2136:lod_obd_get_info()) fs01-MDT0000-mdtlov: fs01-MDT0001-osp-MDT0000 is not ready. 1670966308.518038:0:29630:0:(lod_dev.c:2145:lod_obd_get_info()) Process leaving (rc=18446744073709551605 : -11 : fffffffffffffff5) 1670966308.518040:0:29630:0:(mdd_device.c:1615:mdd_obd_get_info()) Process leaving (rc=18446744073709551605 : -11 : fffffffffffffff5) 1670966308.518042:0:29630:0:(mdt_handler.c:6693:mdt_obd_connect()) Process leaving (rc=18446744073709551605 : -11 : fffffffffffffff5) 1670966308.518044:0:29630:0:(ldlm_lib.c:1446:target_handle_connect()) Process leaving via out (rc=18446744073709551605 : -11 : 0xfffffffffffffff5) |
| Comment by Gerrit Updater [ 29/Dec/22 ] |
|
"Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49528 |
| Comment by Andreas Dilger [ 06/Jan/23 ] |
|
"Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/49569 |
| Comment by Xing Huang [ 07/Jan/23 ] |
|
2023-01-07: The fix patch(#49569) is being worked on. |
| Comment by Xing Huang [ 06/Apr/23 ] |
|
2023-04-06: The fix patch(#49569) is being reviewed, may needs to be updated. |
| Comment by Xing Huang [ 28/Apr/23 ] |
|
2023-04-28: The fix patch(#49569) is being improved per review feedback. |
| Comment by Xing Huang [ 06/May/23 ] |
|
2023-05-13: The improving patch(#49569) is ready to land(on master-next branch). |
| Comment by Gerrit Updater [ 19/May/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49569/ |
| Comment by Xing Huang [ 20/May/23 ] |
|
2023-05-20: The improving patch(#49569) landed to master, another patch(#49528) is being discussed. |
| Comment by Gerrit Updater [ 03/Jun/23 ] |
|
"Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51208 |
| Comment by Xing Huang [ 11/Jun/23 ] |
|
2023-06-17: The second patch (#49528) is ready to land(on master-next branch), the third patch adding test case is being reviewed |
| Comment by Gerrit Updater [ 20/Jun/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49528/ |
| Comment by Xing Huang [ 25/Jun/23 ] |
|
2023-06-25: The second patch (#49528) landed to master, the third patch(#51208) adding test case is ready to land(on master-next branch) |
| Comment by Gerrit Updater [ 28/Jun/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51208/ |
| Comment by Peter Jones [ 28/Jun/23 ] |
|
Landed for 2.16 |
| Comment by Yang Sheng [ 16/Oct/23 ] |
|
Hi, Andreas, Looks like the mds0 was still waiting for recovery. But mds1 was not blocked on lod part rather than communication. Do we need prolong the waiting time? Thanks, |
| Comment by Andreas Dilger [ 19/Oct/23 ] |
|
YS, can you see why mds0 was not finished recovery? If it was making progress, then waiting longer would be OK (VM testing can be very unpredictable). However, if it is stuck for some other reason then waiting will not help and the blocker to finish recovery needs to be fixed. |