[LU-4966] handle server registration errors gracefully Created: 28/Apr/14 Updated: 05/Dec/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Niu Yawei (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | llnl | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Rank (Obsolete): | 13736 | ||||||||||||||||||||||||
| Description |
|
If some server registered successfully on MGS, but it got an error registration reply (MGS revoking config locks timeout or other networking problems), then the server will always get -EADDRINUSE error when it try to register next time, because the server index has been occupied on MGS in the first registration. Current solution for above situation is to use writeconf option to force registration. We need to get this improved and make MGS able to handle this gracefully. |
| Comments |
| Comment by Niu Yawei (Inactive) [ 28/Apr/14 ] |
|
I think if MGS save server UUID along with the server index, the it can tell if the registration (acquire for an occupied index) come from same server. |
| Comment by Niu Yawei (Inactive) [ 28/Apr/14 ] |
|
Andreas/Alex, any suggestions? Thanks. |
| Comment by Alex Zhuravlev [ 28/Apr/14 ] |
|
this issue was mentioned by Chris in |
| Comment by Andreas Dilger [ 28/Apr/14 ] |
|
Alex, I think Niu was asking for ideas on how this might best be fixed. |
| Comment by Li Wei (Inactive) [ 25/Nov/14 ] |
|
Niu's idea makes sense to me. I spent some time experimenting with it. A real UUID (e.g., "a53bc5ba-687b-4091-fb0b-61489785f247") could easily be generated by back-end-independent mkfs.lustre code and stored in a back-end-specific way (e.g., in ldiskfs "mountdata" or as a ZFS dataset property). (ZFS has pool IDs, but those are pool properties and are only 64-bit.) Current master code always passes empty strings in mti_uuid. It would be nice if real UUIDs could be packed into that field. However, experiments showed:
A possible solution is:
|
| Comment by Andreas Dilger [ 25/Nov/14 ] |
|
There is space in the last_rcvd file to store the UUID, but that has the potential problem that this file may be deleted if there are problems with recovery. As for the OST detection in the connection cide, it would be possible to store the target type in the last byte of the UUID or similar (e.g. the ASCII "O" or "M") and still make the rest of the UUID random. |
| Comment by Sebastien Piechurski [ 03/Nov/16 ] |
|
Was this issue somehow adressed in latest versions ? |