[LU-8045] MDT fails to allow client mounts if one MDT is not connected Created: 19/Apr/16 Updated: 07/Jun/16 |
|
| Status: | In Progress |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Olaf Faaland | Assignee: | Lai Siyao |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
TOSS 2 (RHEL 6.7 based) |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
See Seems as if MDT0000 ought to be able to allow mounts, and the filesystem should simply function without the apparently broken MDT. |
| Comments |
| Comment by Olaf Faaland [ 19/Apr/16 ] |
|
All the clients are unable to connect to the MDTs; the imports on the client show repeated connection attempts, even though all but one MDT seems to have started normally. Here is one example: ==> ./mdc/lustre-MDT0001-mdc-ffff880fc4ec5400/state <== current_state: DISCONN state_history: - [ 1461090634, CONNECTING ] - [ 1461090634, DISCONN ] - [ 1461090659, CONNECTING ] - [ 1461090659, DISCONN ] - [ 1461090684, CONNECTING ] - [ 1461090684, DISCONN ] - [ 1461090709, CONNECTING ] - [ 1461090709, DISCONN ] - [ 1461090734, CONNECTING ] - [ 1461090734, DISCONN ] - [ 1461090759, CONNECTING ] - [ 1461090759, DISCONN ] - [ 1461090784, CONNECTING ] - [ 1461090784, DISCONN ] - [ 1461090809, CONNECTING ] - [ 1461090809, DISCONN ] |
| Comment by Olaf Faaland [ 19/Apr/16 ] |
|
The issue summary I wrote is wrong; it seems to me like it's any MDT, not just MDT0000. I don't have the ability to change ticket summaries, so one of you Intel folk could fix it, please, that would be great. |
| Comment by Peter Jones [ 19/Apr/16 ] |
|
That ok Olaf? |
| Comment by Olaf Faaland [ 19/Apr/16 ] |
|
Yes, thank you Peter. |
| Comment by Di Wang [ 20/Apr/16 ] |
|
Well, in current implementation, only prepare succeeds (at the end of server_start_targets()), then the target is allowed to be connected (obd_no_conn is set to be 0). I am guessing with disconnected MDTs, it will block the prepare or configuration process (see server_start_targets()), so client can not connect to the MDT. Not sure how easy to fix this. Is this an important issue? |
| Comment by Olaf Faaland [ 28/Apr/16 ] |
|
Di, It looks to me like the code requires that all MDTs successfully connect with each other before any of them will accept connections from clients. Not just the first time they are started, but any time. If I am correct, then I would say that yes, it is an important issue. Suppose that there is a power outage and all the MDSs go down, and when power is restored one does not come up (not counting MDT0000 which is of course special). Why not accept connections on the MDTs that are up? Depending on how the namespace is distributed across MDTs, it may be possible to do work. But maybe I'm mistaken about some of that. If so, let me know. thanks, |
| Comment by Di Wang [ 28/Apr/16 ] |
It looks to me like the code requires that all MDTs successfully connect with each other before any of them will accept connections from clients. Not just the first time they are started, but any time. Actually, it does not require all MDTs to be connect, but it does require the config log of one MDT is executed, before it can accept the connection request. Sorry, I did not make it clear in the last comments. Suppose that there is a power outage and all the MDSs go down, and when power is restored one does not come up (not counting MDT0000 which is of course special). Why not accept connections on the MDTs that are up? Depending on how the namespace is distributed across MDTs, it may be possible to do work. Yes, this example does make sense. But if the user know one or some MDTs can not get back, it needs to manually deactivate these MDTs on client and other MDTs (which probably cause the failure of this ticket) lctl --device xxx-mdc-xxxx deactivate then the recovery efforts on these MDTs will be stopped, and those recovery MDTs will be able to accept the connection from clients, and of course clients will only be able to access the file on restored MDTs. Sorry again, I might gave the obscure information in the last comment. And there are even such test cases in conf-sanity.sh 70c and 70d, please check. Thanks. |