HSM - Coordinator - Land to Master (LU-2061)

[LU-2713] limit HSM RPC count from client Created: 30/Jan/13  Updated: 13/Mar/13  Resolved: 13/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Technical task Priority: Blocker
Reporter: Andreas Dilger Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: MB

Issue Links:
Related
is related to LU-2949 ensure MDC RPCs are controlled by max... Closed
Rank (Obsolete): 6608

 Description   

The client-side HSM coordinator patches in http://review.whamcloud.com/5029 and http://review.whamcloud.com/5030 were landed, but Oleg realized that there are no client-side limits on the number of concurrent RPCs that can be sent.

This could potentially overwhelm the MDS service threads and block all other requests if they become blocked handling HSM requests, or if they are not being processed very quickly.

Please institue a client-side RPC limit, like cl_max_rpcs_in_flight, but for HSM requests, that introduces some reasonable limit.

The ticket is assigned to Jinshan, but only because we cannot currently assign it to someone external.



 Comments   
Comment by jacques-charles lafoucriere [ 11/Feb/13 ]

I will work on a patch

Comment by John Hammond [ 05/Mar/13 ]

Can the client side maximum include 0 as a possible value (or even a default, unless root)? Otherwise, a malicious/accident-prone user can simply issue HSM RPCs from multiple clients: "Hmm, login1 seems wedged. I think I'll kill my ssh session and try this again on login2."

Would an MDT side limit be better?

Comment by jacques-charles lafoucriere [ 05/Mar/13 ]

As I understand the limit comes from the MDT capacity to receive RPC request, so an MDT side is better but if the MDT had to count the requests it will already have received them so too late. The client side is a simple way to limit the load.

Do you confirm you work on a patch (so I will not prepare one)

Comment by John Hammond [ 05/Mar/13 ]

I was proposing that the MDT keep a semaphore (as with cl_max_rpcs_in_flight) but that it do a non blocking down. If the semaphore would block then it returns -EAGAIN to the client. Then the client must wait and retry.

I understood that processing some HSM requests would put the MDT thread to sleep until the coordinator responded. Is that correct? I have only seen the stubbed out version of mdt_hsm.c. Will any of these handlers every have to wait for tape?

In either case (waiting on the coordinator or waiting on tape) I think it must be handled as an unbounded wait by Lustre.

I confirm that I will work on a patch.

Comment by jacques-charles lafoucriere [ 05/Mar/13 ]

HSM request are not blocking, they just record something to do on the MDT and the restore/archive is done asynchronously by coordinator. We the use of EAGAIN the only risk is to have slow clients which are never served because fast one are always taking the slots. We need a way to be sure all the clients are doing progress in their call list

Comment by Andreas Dilger [ 06/Mar/13 ]

John, the current RPC throttling mechanism for OSC and MDC RPCs is on the client. While this is not ideal, the problem is indeed that if the server has seen the request that it is too late to throttle it.

At this stage, we're just looking for an equivalent to max_rpcs_in_flight for the HSM requests, so they do not overwhelm the server.

Comment by John Hammond [ 06/Mar/13 ]

OK, thanks for the clarification.

Please see http://review.whamcloud.com/5616.

Comment by Peter Jones [ 13/Mar/13 ]

Landed for 2.4

Generated at Sat Feb 10 01:27:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.