[LU-15514] Do not wait for clients to start recovery if there are no clients. Created: 02/Feb/22 Updated: 03/Feb/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Oleg Drokin | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
With idle-disconnect code a situation can happen where entire cluster is idle for some time and as all the servers restart, the recovery on OSTs does not start as there are no client connections. The MDTs connections to OSTs are rejected because those are considered to be new connections. We need to either accept new MDTs in similar to how we do when MDT and OST are colocated on the same node or we need to start the recovry time on first such connection and then proceed with the eviction as the timeout expires to allow them to rejoin as the new clients they are. Failing to do this would cause entire cluster delay as the idle-disconnected clients become active again and would need to wait for the recovery to finish first even if the servers restart happened long ago |
| Comments |
| Comment by Andreas Dilger [ 02/Feb/22 ] |
|
IIRC, the MDT->OST connections used to use a fixed UUID like "$fsname-MDT0000_UUID" or something, so that they could always connect to the OSTs, even during recovery. It's possible that this was changed at one point because of LWP/OUT connections, or something, but this should be investigated. The MDT should not be blocked from connecting during recovery. Whether the MDT connections should trigger the recovery timer is a separate issue. I think they should not, otherwise if the storage cluster is disconnected from the compute nodes because of a brief switch problem then all clients would all be evicted. However, as Oleg noted, if there are no other connections besides the MDT(s), then recovery could finish immediately. |
| Comment by Oleg Drokin [ 03/Feb/22 ] |
|
ye, MDT connection alone should not trigger start of recovery, except they are the only type of records in last_rcvd. |