Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.15.0
-
2.15.0-rc3
-
3
-
9223372036854775807
Description
Hit the same issue twice during soak testing on the same MDS node
trying to connect to soak-10 ... 2022-04-15 00:35:34,865:fsmgmt.fsmgmt:INFO soak-10 is up!!! 2022-04-15 00:35:45,877:fsmgmt.fsmgmt:INFO Failing over soaked-MDT0002 ... 2022-04-15 00:35:45,877:fsmgmt.fsmgmt:INFO Mounting soaked-MDT0002 on soak-11 ... 2022-04-15 00:36:42,914:fsmgmt.fsmgmt:INFO ... soaked-MDT0002 mounted successfully on soak-11 2022-04-15 00:36:42,914:fsmgmt.fsmgmt:INFO ... soaked-MDT0002 failed over 2022-04-15 00:36:42,914:fsmgmt.fsmgmt:INFO Wait for recovery to complete 2022-04-15 00:36:43,502:fsmgmt.fsmgmt:DEBUG Recovery Result Record: {'soak-11': {'soaked-MDT0003': 'COMPLETE', 'soaked-MDT0002': 'WAITING'}} 2022-04-15 00:36:43,503:fsmgmt.fsmgmt:INFO soaked-MDT0002 in status 'WAITING'.
on soak-11
[136090.466173] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages [136538.516253] Lustre: soaked-MDT0003: Client ec0f61fb-4e34-4a0b-b633-b84f3e77a68b (at 192.168.1.138@o2ib) reconnecting [136538.528122] Lustre: Skipped 8 previous similar messages [136570.792064] LustreError: 43760:0:(llog.c:781:llog_process_thread()) soaked-MDT0001-osp-MDT0002 retry remote llog process [136570.804339] LustreError: 43760:0:(llog.c:781:llog_process_thread()) Skipped 36771 previous similar messages [136690.448382] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reached hard limit: 900, exten d: 1 [136690.462972] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages [137139.504191] Lustre: soaked-MDT0003: Client 6734fd00-3e57-422d-a9ca-e8a57bcfd8f1 (at 192.168.1.135@o2ib) reconnecting [137139.516055] Lustre: Skipped 10 previous similar messages [137170.777941] LustreError: 43760:0:(llog.c:781:llog_process_thread()) soaked-MDT0001-osp-MDT0002 retry remote llog process [137170.790212] LustreError: 43760:0:(llog.c:781:llog_process_thread()) Skipped 36789 previous similar messages [137180.761466] Lustre: mdt00_016: service thread pid 12073 was inactive for 200.281 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [137180.782540] Pid: 12073, comm: mdt00_016 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Mon Dec 13 22:00:59 UTC 2021 [137180.793815] Call Trace: [137180.796783] [<0>] top_trans_stop+0x882/0xfa0 [ptlrpc] [137180.802560] [<0>] lod_trans_stop+0x25c/0x340 [lod] [137180.808065] [<0>] mdd_trans_stop+0x2e/0x174 [mdd] [137180.813433] [<0>] mdd_create+0x154a/0x1cd0 [mdd] [137180.818740] [<0>] mdo_create+0x46/0x48 [mdt] [137180.823632] [<0>] mdt_create+0xab1/0xe40 [mdt] [137180.828720] [<0>] mdt_reint_create+0x3a0/0x460 [mdt] [137180.834388] [<0>] mdt_reint_rec+0x8a/0x240 [mdt] [137180.839662] [<0>] mdt_reint_internal+0x76c/0xb50 [mdt] [137180.845516] [<0>] mdt_reint+0x67/0x150 [mdt] [137180.850456] [<0>] tgt_request_handle+0x92f/0x19c0 [ptlrpc] [137180.856729] [<0>] ptlrpc_server_handle_request+0x253/0xc30 [ptlrpc] [137180.863875] [<0>] ptlrpc_main+0xbf4/0x15e0 [ptlrpc] [137180.869420] [<0>] kthread+0xd1/0xe0 [137180.873423] [<0>] ret_from_fork_nospec_begin+0x21/0x21 [137180.879286] [<0>] 0xfffffffffffffffe [137290.446156] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reached hard limit: 900, extend: 1 [137290.460730] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages [137575.430810] Lustre: 7111:0:(service.c:1436:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply req@ffff8f5f6cb94380 x1729527267995904/t176094400030(0) o36->f5a20605-7e7a-45f1-bf01-daf0664944ec@192.168.1.118@o2ib:514/0 lens 552/448 e 24 to 0 dl 1649907299 ref 2 fl Interpret:/0/0 rc 0/0 job:'' [137740.477658] Lustre: soaked-MDT0003: Client ec0f61fb-4e34-4a0b-b633-b84f3e77a68b (at 192.168.1.138@o2ib) reconnecting [137740.489550] Lustre: Skipped 9 previous similar messages [137770.755234] LustreError: 43760:0:(llog.c:781:llog_process_thread()) soaked-MDT0001-osp-MDT0002 retry remote llog process [137770.767489] LustreError: 43760:0:(llog.c:781:llog_process_thread()) Skipped 36738 previous similar messages [137890.442958] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reached hard limit: 900, extend: 1 [137890.457533] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages [138341.465763] Lustre: soaked-MDT0003: Client 6734fd00-3e57-422d-a9ca-e8a57bcfd8f1 (at 192.168.1.135@o2ib) reconnecting [138341.477619] Lustre: Skipped 11 previous similar messages
have to run abort_recovery
[188290.202475] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages [188327.367445] LustreError: 37182:0:(mdt_handler.c:7419:mdt_iocontrol()) soaked-MDT0002: Aborting recovery for device [188327.379129] LustreError: 37182:0:(ldlm_lib.c:2876:target_stop_recovery_thread()) soaked-MDT0002: Aborting recovery [188327.390814] Lustre: 43762:0:(ldlm_lib.c:2283:target_recovery_overseer()) recovery is aborted, evict exports in recovery [188327.392289] LustreError: 43760:0:(lod_dev.c:424:lod_sub_recovery_thread()) soaked-MDT0001-osp-MDT0002 get update log failed: rc = -11 [188327.418009] Lustre: 43762:0:(ldlm_lib.c:2273:target_recovery_overseer()) soaked-MDT0002 recovery is aborted by hard timeout [188327.486462] Lustre: soaked-MDT0002-osp-MDT0003: Connection restored to 192.168.1.111@o2ib (at 0@lo) [188327.487906] Lustre: soaked-MDT0002: Recovery over after 2739:44, of 22 clients 22 recovered and 0 were evicted. [188345.146528] Lustre: Failing over soaked-MDT0002 [188345.224943] LustreError: 3633:0:(mdt_reint.c:2741:mdt_reint_rename()) soaked-MDT0002: cannot lock for rename: rc = -5 [188345.236924] LustreError: 3633:0:(mdt_reint.c:2741:mdt_reint_rename()) Skipped 1 previous similar message [188345.238945] Lustre: soaked-MDT0002: Not available for connect from 192.168.1.126@o2ib (stopping) [188345.238946] Lustre: Skipped 2 previous similar messages
Attachments
Issue Links
- is related to
-
LU-15937 lctl llog commands do not work for DNE recovery logs
- Open
-
LU-14495 (llog_osd.c:622:llog_osd_write_rec()) LBUG
- Resolved
-
LU-13974 Wrong behavior for out operations create+write especially for update log
- Resolved
-
LU-15645 gap in recovery llog should not be a fatal error
- Resolved
-
LU-15938 MDT recovery did not finish due to corrupt llog record
- Resolved
- is related to
-
LU-11418 hung threads on MDT and MDT won't umount
- Resolved
-
LU-15769 llog_osd_next_block()): invalid llog tail at log id ...
- Resolved
- mentioned in
-
Page Loading...