Hongchao, is this patch only needed for the case when the remote MDT is not available at all, or is there some other problem like LU-13608 that is causing recovery to be stuck for a long time even when all of the MDTs are available?
Rather than timing out recovery for a remote MDT completely, it would probably be better to keep the recovery for that MDT pending until the MDT is available again, and then do the remote recovery when the MDTs reconnect. That might only be a small (or no) difference in the normal case when all of the MDTs are available at mount, but I think this may give a very important improvement when some MDTs are unavailable.
The big improvement would be if MDT0000 and other MDTs are available at restart time, it would complete recovery with all those MDTs quickly, and not block access to files/directories that are on available MDTs. It would allow most client access to work, and only remote/striped directories would be blocked and/or time out (allow CTRL-C for client processes). This would be better for users, if they are mostly using remote directories for subtrees of the filesystem, since only the subtrees on the missing MDTs would be inaccessible.
Eventually, having mirrored entries for ROOT/ on several/all MDTs could allow the filesystem to be accessible even if MDT0000 is unavailable, but that is definitely a separate project.
Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41424
Subject: LU-14318 ldlm: don't wait other MDT forever
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4c2b2b26bd29b1a8f522fbf705a3d145cb06eb9f