Client eviction on lock callback timeout (LU-874)

[LU-917] shared single file client IO submission with many cores starves OST DLM locks due to max_rpcs_in_flight Created: 13/Dec/11  Updated: 08/Feb/18  Resolved: 08/Feb/18

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0, Lustre 2.2.0
Fix Version/s: None

Type: Technical task Priority: Minor
Reporter: Andreas Dilger Assignee: Jinshan Xiong (Inactive)
Resolution: Won't Fix Votes: 0
Labels: llnl

Rank (Obsolete): 10219

 Description   

In LU-874 the shared single-file IOR testing with 512 threads on 32 clients (16 cores per client) writing 128MB chunks to a file striped over 2 OSTs. This showed clients timing out on DLM locks. The threads on the single client are writing to disjoint parts of the file (i.e. each thread has its own DLM extent that is not adjacent to the extents written by other threads on that client).

For example, to reproduce this workload with 4 clients (A, B, C, D) against 2 OSTs (1, 2):

Client ABCDABCDABCD...
OST 121212121212...

While this IOR test is running, other tests are also running on different clients to create a very heavy IO load on the OSTs.

It may be that DLM locks on the OST are not getting any IO requests sent to refresh the DLM locks:

  • due to the number of active DLM locks on the client for a single OST being more than the number of rpcs in flight, some of the locks may be starved for sending BRW RPCs under that lock to the OST to refresh the lock timeout
  • due to the IO ordering of the BRW requests on the client, it may be that all of the pages for the lower-offset extent are sent to the OST before the pages for a higher-offset extent are ever sent
  • the high priority request queue on the OST may not be enough to help this if several locks on the client for one OST are canceled at the same time

Some solutions that might help this (individually, or in combination):
1. increase the max_rpcs_in_flight = core count, but I think this is bad in the long run since it can dramatically increase the number of RPCs that need to be handled at one time by each OST
2. always allow at least one BRW RPC in flight for each lock that is being canceled
3. prioritize ALL BRW RPCs for a blocked lock in advance of non-blocked BRW requests (e.g. like a high-priority request queue on the client)
4. both (2) and (3) may be needed in order to avoid starvation as the client core count increases



 Comments   
Comment by Nathan Rutman [ 09/Mar/12 ]

"the high priority request queue on the OST may not be enough to help this if several locks on the client for one OST are canceled at the same time"

You mean the HP thread can't handle multiple cancel callbacks before some time out? I was wondering why we don't reserve more threads for HP reqs, or, alternately, limit the number of threads doing any 1 op (i.e. no more than 75% of threads can be doing ldlm ops, and no more than 75% of threads can be doing io ops), so that we "balance" the load a little better and don't get stuck in these corner cases.

Comment by Christopher Morrone [ 09/Mar/12 ]

Nathan, the issue is that the client is only allowed a fixed number of outstanding rpcs to the ost. Lets call that N. Now lets assume that the OST is processing RPCs very slowly (minutes each), but otherwise operating normally.

If the OST revokes N+1 locks from the client now, the client stands a real risk of being evicted. In order to avoid eviction the client must constantly have rpcs enqueued on the server for EACH of the revoked locks. (We fixed some things in LU-874 to help make even that work.) Otherwise one of the locks will time out, and the client will be evicted.

This ticket is looking at ways to alleviate the problem from the client side. I do worry that these client side solutions increase the load on a server that is already heavily loaded.

Ultimately, we need to look at making the OST smarter whether or not we decide that client side changes have value. The OST really needs to assume that if the client is making progress on other revoked locks, then it should extend all locks timers for that client in good faith.

Comment by Nathan Rutman [ 09/Mar/12 ]

There's a few different issues here; I agree the rpcs_in_flight scenario seems to be one problem, but I was more interested in the limited-server-thread problem (even if it's not causing LU-874) because it is causing other problems as well. For example, we're tracking a bug (MRP-455) where we experience cascading client evictions because all MDS threads are stuck pending ldlm enqueues, leaving no room for PING or CONNECT rpcs. (That one is a direct result of mishandled HP queue, but it made me realize we have no "wiggle room" in the code today. As with all our bugs, we'll submit it upstream when we're done.)

Comment by Christopher Morrone [ 09/Mar/12 ]

Why wait until you are done? I'd certainly like to be made aware of the problem and progress as you go along in a new ticket.

Comment by Nathan Rutman [ 12/Mar/12 ]

It's difficult to track progress in two different places; our primary tracker is our own internal Jira.

Comment by Christopher Morrone [ 12/Mar/12 ]

Nathan, it really does the community a disservice to keep your issues secret. Telling us an internal Xyratex ticket number is of no use to us.

I can only imagine that working in secret like this would make it more difficult to get patches landed as well. If outside developers aren't tapped into the discussion about the issue all along, it just increases the burden on you to present a complete and detailed explanation of both the problem and the solution. Should there be a disagreement about approach, you may find that you've wasted your time.

LLNL has the same issues of dealing with multiple trackers. It is just one that needs to be accepted, I think. We use our internal tracker to discuss and track issues with admins and users, but keep most of the the technical discussion in jira where the world can see it.

Comment by Nathan Rutman [ 13/Mar/12 ]

Chris, I appreciate your concerns here. There are good reasons why we must keep our bug tracking system internal: the privacy of our customers; our time tracking and billing systems; our requirement to track non-Lustre bugs as well.
Perhaps something could be set up to automatically mirror Lustre bug comments out to Whamcloud's system. Please email me directly nathan_rutman@xyratex.com for further discussion on this topic and let's leave this poor bug alone

Comment by Nathan Rutman [ 22/Mar/12 ]

Xyratex MRP-455 posted in LU-1239 with patch.

Comment by Jinshan Xiong (Inactive) [ 08/Feb/18 ]

close old tickets

Generated at Sat Feb 10 01:11:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.