Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Medium
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Even with ~~LU-17379~~ mounts with multiple MGSnodes can still take a long time.

The problem is that the MGS_CONNECT requests are still sent out one at a time to the MGSnodes with a delay of ~at_min between them. And the initial assumption from ~~LU-17379~~ about getting error responses back is wrong:

"It should be fairly quick to determine if the MGS is not responding on a particular NID, because the client will get a rapid error response (e.g. -ENODEV or -ENOTCONN or -EHOSTUNREACH with a short RPC timeout) so in that case it should try all of the NIDs once quickly. If it gets ETIMEDOUT that might mean the node is unavailable or overloaded, or it might mean the MGS is not running yet, so the client should back off and retry the NIDs with a longer timeout after the initial burst."

There is no response sent back from a node that doesn't have the MGS mounted, regardless of it having MDTs/OSTs mounted, (verified by tcpdump on client that uses tcp to connect to servers)

This can easily be seen by setting up 2 server nodes, one with the mgs and the other with mdt/ost using tcp. Setup a client using tcp, start tcpdump and mount the file system with the mgs as the second mgsnode in the mount arguments. Then try again with a higher at_min.

Attachments

Issue Links

is related to

LU-19515 Client mount failed over a network beyond first 32

Open

Activity

People

Assignee:: WC Triage

Reporter:: Åke Sandgren

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 28/Oct/25 6:26 PM

Updated:: 2 days ago 5:21 AM