HSM - Coordinator - Land to Master
(LU-2061)
|
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Technical task | Priority: | Blocker |
| Reporter: | Andreas Dilger | Assignee: | John Hammond |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | MB | ||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 6608 | ||||||||
| Description |
|
The client-side HSM coordinator patches in http://review.whamcloud.com/5029 and http://review.whamcloud.com/5030 were landed, but Oleg realized that there are no client-side limits on the number of concurrent RPCs that can be sent. This could potentially overwhelm the MDS service threads and block all other requests if they become blocked handling HSM requests, or if they are not being processed very quickly. Please institue a client-side RPC limit, like cl_max_rpcs_in_flight, but for HSM requests, that introduces some reasonable limit. The ticket is assigned to Jinshan, but only because we cannot currently assign it to someone external. |
| Comments |
| Comment by jacques-charles lafoucriere [ 11/Feb/13 ] |
|
I will work on a patch |
| Comment by John Hammond [ 05/Mar/13 ] |
|
Can the client side maximum include 0 as a possible value (or even a default, unless root)? Otherwise, a malicious/accident-prone user can simply issue HSM RPCs from multiple clients: "Hmm, login1 seems wedged. I think I'll kill my ssh session and try this again on login2." Would an MDT side limit be better? |
| Comment by jacques-charles lafoucriere [ 05/Mar/13 ] |
|
As I understand the limit comes from the MDT capacity to receive RPC request, so an MDT side is better but if the MDT had to count the requests it will already have received them so too late. The client side is a simple way to limit the load. Do you confirm you work on a patch (so I will not prepare one) |
| Comment by John Hammond [ 05/Mar/13 ] |
|
I was proposing that the MDT keep a semaphore (as with cl_max_rpcs_in_flight) but that it do a non blocking down. If the semaphore would block then it returns -EAGAIN to the client. Then the client must wait and retry. I understood that processing some HSM requests would put the MDT thread to sleep until the coordinator responded. Is that correct? I have only seen the stubbed out version of mdt_hsm.c. Will any of these handlers every have to wait for tape? In either case (waiting on the coordinator or waiting on tape) I think it must be handled as an unbounded wait by Lustre. I confirm that I will work on a patch. |
| Comment by jacques-charles lafoucriere [ 05/Mar/13 ] |
|
HSM request are not blocking, they just record something to do on the MDT and the restore/archive is done asynchronously by coordinator. We the use of EAGAIN the only risk is to have slow clients which are never served because fast one are always taking the slots. We need a way to be sure all the clients are doing progress in their call list |
| Comment by Andreas Dilger [ 06/Mar/13 ] |
|
John, the current RPC throttling mechanism for OSC and MDC RPCs is on the client. While this is not ideal, the problem is indeed that if the server has seen the request that it is too late to throttle it. At this stage, we're just looking for an equivalent to max_rpcs_in_flight for the HSM requests, so they do not overwhelm the server. |
| Comment by John Hammond [ 06/Mar/13 ] |
|
OK, thanks for the clarification. Please see http://review.whamcloud.com/5616. |
| Comment by Peter Jones [ 13/Mar/13 ] |
|
Landed for 2.4 |