[LU-13834] HSM requests lost after retries (NoRetryAction disabled) Created: 30/Jul/20 Updated: 30/Jul/20 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Dominique Martinet (Inactive) | Assignee: | Dominique Martinet (Inactive) |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | CEA | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
A bit of setup first, our hpss lhsm agents have a configuration that tell them to accept up to x restore, y archive etc etc. On the other hand the number of requests that can be sent by the server is greater than the number of either of these (because why make restores wait if the servers are only busy archiving), but there is no knob to specify at a coordinator level the max per type of operation so the max concurrent requests is bigger than what the agents can handle with any single type of operation.
Back in 2.10, we noticed disabling NoRetryAction was bugged (request was dropped when told to try again but the coordinator would keep lock on file so that was pretty horrible)... So we kept the setting to it is by default, and when the coordinator sends such a request the agents refuse and the request is just dropped. Restore requests (e.g. client read) in 2.10 would just keep retrying, our own hsm helpers also retried, and archives would also just be retried later all is fine. Upon upgrading to 2.12, users complained frequently seeing "no data available" when reading released files. We noticed that apparently on 2.12 if all servers are busy and a request is refused, the client behaviour changed from transparent retry to just giving the error back to userspace, and user codes aren't ready to handle that (despite our efforts to tell them to use our helper...) This led to us re-enabling the retry (disabling NoRetryAction), as after audit we were convinced the problem we had in 2.10 is no longer there in 2.12 (that's why we never opened a ticket for it back then, that issue IS fixed)
Now I am seeing some retries happening, but we still experience some troubles:
I've just enabled hsm debug logs on the MDS, will provide more infos if I have something.
|
| Comments |
| Comment by Peter Jones [ 30/Jul/20 ] |
|
Dominique Is this something that you plan to investigate? Thanks Peter |
| Comment by Dominique Martinet (Inactive) [ 30/Jul/20 ] |
|
Hi Peter, I will at least look at the dk log tomorrow and report on that, but not sure I will have time to look further. FYI I am leaving CEA next week (!!), so don't expect too much ! Dominique |
| Comment by Peter Jones [ 30/Jul/20 ] |
|
Ok. All the best in your future endeavours! |