Client eviction on lock callback timeout
(LU-874)
|
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 2.2.0 |
| Fix Version/s: | None |
| Type: | Technical task | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | llnl | ||
| Rank (Obsolete): | 10219 |
| Description |
|
In For example, to reproduce this workload with 4 clients (A, B, C, D) against 2 OSTs (1, 2): Client ABCDABCDABCD... While this IOR test is running, other tests are also running on different clients to create a very heavy IO load on the OSTs. It may be that DLM locks on the OST are not getting any IO requests sent to refresh the DLM locks:
Some solutions that might help this (individually, or in combination): |
| Comments |
| Comment by Nathan Rutman [ 09/Mar/12 ] |
|
"the high priority request queue on the OST may not be enough to help this if several locks on the client for one OST are canceled at the same time" You mean the HP thread can't handle multiple cancel callbacks before some time out? I was wondering why we don't reserve more threads for HP reqs, or, alternately, limit the number of threads doing any 1 op (i.e. no more than 75% of threads can be doing ldlm ops, and no more than 75% of threads can be doing io ops), so that we "balance" the load a little better and don't get stuck in these corner cases. |
| Comment by Christopher Morrone [ 09/Mar/12 ] |
|
Nathan, the issue is that the client is only allowed a fixed number of outstanding rpcs to the ost. Lets call that N. Now lets assume that the OST is processing RPCs very slowly (minutes each), but otherwise operating normally. If the OST revokes N+1 locks from the client now, the client stands a real risk of being evicted. In order to avoid eviction the client must constantly have rpcs enqueued on the server for EACH of the revoked locks. (We fixed some things in This ticket is looking at ways to alleviate the problem from the client side. I do worry that these client side solutions increase the load on a server that is already heavily loaded. Ultimately, we need to look at making the OST smarter whether or not we decide that client side changes have value. The OST really needs to assume that if the client is making progress on other revoked locks, then it should extend all locks timers for that client in good faith. |
| Comment by Nathan Rutman [ 09/Mar/12 ] |
|
There's a few different issues here; I agree the rpcs_in_flight scenario seems to be one problem, but I was more interested in the limited-server-thread problem (even if it's not causing |
| Comment by Christopher Morrone [ 09/Mar/12 ] |
|
Why wait until you are done? I'd certainly like to be made aware of the problem and progress as you go along in a new ticket. |
| Comment by Nathan Rutman [ 12/Mar/12 ] |
|
It's difficult to track progress in two different places; our primary tracker is our own internal Jira. |
| Comment by Christopher Morrone [ 12/Mar/12 ] |
|
Nathan, it really does the community a disservice to keep your issues secret. Telling us an internal Xyratex ticket number is of no use to us. I can only imagine that working in secret like this would make it more difficult to get patches landed as well. If outside developers aren't tapped into the discussion about the issue all along, it just increases the burden on you to present a complete and detailed explanation of both the problem and the solution. Should there be a disagreement about approach, you may find that you've wasted your time. LLNL has the same issues of dealing with multiple trackers. It is just one that needs to be accepted, I think. We use our internal tracker to discuss and track issues with admins and users, but keep most of the the technical discussion in jira where the world can see it. |
| Comment by Nathan Rutman [ 13/Mar/12 ] |
|
Chris, I appreciate your concerns here. There are good reasons why we must keep our bug tracking system internal: the privacy of our customers; our time tracking and billing systems; our requirement to track non-Lustre bugs as well. |
| Comment by Nathan Rutman [ 22/Mar/12 ] |
|
Xyratex MRP-455 posted in |
| Comment by Jinshan Xiong (Inactive) [ 08/Feb/18 ] |
|
close old tickets |