[LU-9973] MDT recovery status never completed Created: 12/Sep/17  Updated: 19/Sep/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: None

Type: Question/Request Priority: Minor
Reporter: sebg-crd-pm (Inactive) Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: None

Rank (Obsolete): 9223372036854775807

 Description   

Hi,

When I re-mount mdts, MDT try to recovery all clients. But the mdts recovery status never completed ( several hours). The recovery process will not to be end even reach hard limit (900). I can only abort recovery process by "lctl --device MDTxxx abort_recovery".
1.Can I abort recovery process by "mount.lustre -o avort_recov xx" ? I can not execute command sucessful. (command hang)
2.Can I evict all stalled clients to end the recovery process?
3.Is there any side effect to abort recovery process by "lctl abort_recovery" ?

Thanks.

[ 570.019086] LNet: 14185:0:(o2iblnd_cb.c:3209:kiblnd_check_conns()) Timed out tx for 192.168.5.202@o2ib6: 7 seconds
[ 622.166399] Lustre: 18740:0:(ldlm_lib.c:1784:extend_recovery_timer()) hpcfs-MDT0000: extended recovery timer reaching hard limit: 900, extend: 1
[ 622.973445] Lustre: 26593:0:(ldlm_lib.c:1784:extend_recovery_timer()) hpcfs-MDT0001: extended recovery timer reaching hard limit: 900, extend: 1
[ 622.973448] Lustre: 26593:0:(ldlm_lib.c:1784:extend_recovery_timer()) Skipped 2 previous similar messages
[ 682.170260] Lustre: 18740:0:(ldlm_lib.c:1784:extend_recovery_timer()) hpcfs-MDT0000: extended recovery timer reaching hard limit: 900, extend: 1
[ 682.170264] Lustre: 18740:0:(ldlm_lib.c:1784:extend_recovery_timer()) Skipped 2 previous similar messages
[ 684.027392] LNet: 14185:0:(o2iblnd_cb.c:3209:kiblnd_check_conns()) Timed out tx for 192.168.7.202@o2ib8: 7 seconds
[ 684.027397] LNet: 14185:0:(o2iblnd_cb.c:3209:kiblnd_check_conns()) Skipped 3 previous similar messages
[ 693.713019] Lustre: 14236:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1505204590/real 0] req@ffff887c5585ad00 x1578320959384656/t0(0) o38->hpcfs-MDT0000-osp-MDT0001@192.168.8.202@o2ib9:24/4 lens 520/544 e 0 to 1 dl 1505204596 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[ 693.713024] Lustre: 14236:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
[ 737.721925] LustreError: 11-0: hpcfs-OST0000-osc-MDT0001: operation ost_connect to node 192.168.17.212@o2ib18 failed: rc = -19



 Comments   
Comment by Peter Jones [ 12/Sep/17 ]

Lai

Could you please advise?

Thanks

Peter

Comment by Andreas Dilger [ 12/Sep/17 ]

It looks like you may be having some sort of communication problem between MDT0000 and MDT0001? The MDS recovery can not complete with multiple MDTs until they are all available, and it looks like there timeouts during connect (o38 = MDS_CONNECT):

Request sent has timed out for sent delay: [sent 1505204590/real 0] req@ffff887c5585ad00 x1578320959384656/t0(0) o38->hpcfs-MDT0000-osp-MDT0001@192.168.8.202@o2ib9

What is also a bit strange to me is that you appear to have a large number of different IB networks? Just in the messages posted here, I see o2ib6, o2ib8, o2ib9, and o2ib18 reported. Do you have an extremely large number of clients, several different filesystems, or are you configuring a separate LNet network for each host? That isn't necessarily a source of problems, but it is unusual and opens more chance of incorrect configuration causing problems.

Comment by sebg-crd-pm (Inactive) [ 19/Sep/17 ]

Thank you for your advise. I can get MDT recovery status completed now.
Too many o2ibx made MDTs need spent much time to try create workable connections.
And too many logs need spent more than 8 hours to update them. (The update log rate too slow, so I clear update log fininally)

Generated at Sat Feb 10 02:30:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.