[LU-8045] MDT fails to allow client mounts if one MDT is not connected Created: 19/Apr/16  Updated: 07/Jun/16

Status: In Progress
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: llnl
Environment:

TOSS 2 (RHEL 6.7 based)
kernel 2.6.32-573.22.1.1chaos.ch5.4.x86_64
Lustre 2.8.0+patches 2.8-llnl-preview1
zfs-0.6.5.4-1.ch5.4.x86_64
1 MGS - separate server
40 MDTs - each on separate server
10 OSTs - each on separate server


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

See LU-8044
With many MDTs, if MDT0000 cannot connect with one of the other MDTs, (perhaps only on initial startup, I don't know), MDT0000 appears to ignore connection requests from clients.

Seems as if MDT0000 ought to be able to allow mounts, and the filesystem should simply function without the apparently broken MDT.



 Comments   
Comment by Olaf Faaland [ 19/Apr/16 ]

All the clients are unable to connect to the MDTs; the imports on the client show repeated connection attempts, even though all but one MDT seems to have started normally.

Here is one example:

==> ./mdc/lustre-MDT0001-mdc-ffff880fc4ec5400/state <==
current_state: DISCONN
state_history:
 - [ 1461090634, CONNECTING ]
 - [ 1461090634, DISCONN ]
 - [ 1461090659, CONNECTING ]
 - [ 1461090659, DISCONN ]
 - [ 1461090684, CONNECTING ]
 - [ 1461090684, DISCONN ]
 - [ 1461090709, CONNECTING ]
 - [ 1461090709, DISCONN ]
 - [ 1461090734, CONNECTING ]
 - [ 1461090734, DISCONN ]
 - [ 1461090759, CONNECTING ]
 - [ 1461090759, DISCONN ]
 - [ 1461090784, CONNECTING ]
 - [ 1461090784, DISCONN ]
 - [ 1461090809, CONNECTING ]
 - [ 1461090809, DISCONN ]
Comment by Olaf Faaland [ 19/Apr/16 ]

The issue summary I wrote is wrong; it seems to me like it's any MDT, not just MDT0000. I don't have the ability to change ticket summaries, so one of you Intel folk could fix it, please, that would be great.

Comment by Peter Jones [ 19/Apr/16 ]

That ok Olaf?

Comment by Olaf Faaland [ 19/Apr/16 ]

Yes, thank you Peter.

Comment by Di Wang [ 20/Apr/16 ]

Well, in current implementation, only prepare succeeds (at the end of server_start_targets()), then the target is allowed to be connected (obd_no_conn is set to be 0). I am guessing with disconnected MDTs, it will block the prepare or configuration process (see server_start_targets()), so client can not connect to the MDT. Not sure how easy to fix this. Is this an important issue?

Comment by Olaf Faaland [ 28/Apr/16 ]

Di,

It looks to me like the code requires that all MDTs successfully connect with each other before any of them will accept connections from clients. Not just the first time they are started, but any time.

If I am correct, then I would say that yes, it is an important issue. Suppose that there is a power outage and all the MDSs go down, and when power is restored one does not come up (not counting MDT0000 which is of course special). Why not accept connections on the MDTs that are up? Depending on how the namespace is distributed across MDTs, it may be possible to do work.

But maybe I'm mistaken about some of that. If so, let me know.

thanks,
Olaf

Comment by Di Wang [ 28/Apr/16 ]
It looks to me like the code requires that all MDTs successfully connect with each other before any of them will accept connections from clients. Not just the first time they are started, but any time.

Actually, it does not require all MDTs to be connect, but it does require the config log of one MDT is executed, before it can accept the connection request. Sorry, I did not make it clear in the last comments.

Suppose that there is a power outage and all the MDSs go down, and when power is restored one does not come up (not counting MDT0000 which is of course special). Why not accept connections on the MDTs that are up? Depending on how the namespace is distributed across MDTs, it may be possible to do work.

Yes, this example does make sense. But if the user know one or some MDTs can not get back, it needs to manually deactivate these MDTs on client and other MDTs (which probably cause the failure of this ticket)

lctl --device xxx-mdc-xxxx deactivate

then the recovery efforts on these MDTs will be stopped, and those recovery MDTs will be able to accept the connection from clients, and of course clients will only be able to access the file on restored MDTs. Sorry again, I might gave the obscure information in the last comment.

And there are even such test cases in conf-sanity.sh 70c and 70d, please check. Thanks.

Generated at Sat Feb 10 02:14:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.