[LU-9973] MDT recovery status never completed Created: 12/Sep/17 Updated: 19/Sep/17 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | None |
| Type: | Question/Request | Priority: | Minor |
| Reporter: | sebg-crd-pm (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Hi, When I re-mount mdts, MDT try to recovery all clients. But the mdts recovery status never completed ( several hours). The recovery process will not to be end even reach hard limit (900). I can only abort recovery process by "lctl --device MDTxxx abort_recovery". Thanks. [ 570.019086] LNet: 14185:0:(o2iblnd_cb.c:3209:kiblnd_check_conns()) Timed out tx for 192.168.5.202@o2ib6: 7 seconds |
| Comments |
| Comment by Peter Jones [ 12/Sep/17 ] |
|
Lai Could you please advise? Thanks Peter |
| Comment by Andreas Dilger [ 12/Sep/17 ] |
|
It looks like you may be having some sort of communication problem between MDT0000 and MDT0001? The MDS recovery can not complete with multiple MDTs until they are all available, and it looks like there timeouts during connect (o38 = MDS_CONNECT): Request sent has timed out for sent delay: [sent 1505204590/real 0] req@ffff887c5585ad00 x1578320959384656/t0(0) o38->hpcfs-MDT0000-osp-MDT0001@192.168.8.202@o2ib9 What is also a bit strange to me is that you appear to have a large number of different IB networks? Just in the messages posted here, I see o2ib6, o2ib8, o2ib9, and o2ib18 reported. Do you have an extremely large number of clients, several different filesystems, or are you configuring a separate LNet network for each host? That isn't necessarily a source of problems, but it is unusual and opens more chance of incorrect configuration causing problems. |
| Comment by sebg-crd-pm (Inactive) [ 19/Sep/17 ] |
|
Thank you for your advise. I can get MDT recovery status completed now. |