Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-874 Client eviction on lock callback timeout
  3. LU-917

shared single file client IO submission with many cores starves OST DLM locks due to max_rpcs_in_flight

Details

    • Technical task
    • Resolution: Won't Fix
    • Minor
    • None
    • Lustre 2.1.0, Lustre 2.2.0
    • 10219

    Description

      In LU-874 the shared single-file IOR testing with 512 threads on 32 clients (16 cores per client) writing 128MB chunks to a file striped over 2 OSTs. This showed clients timing out on DLM locks. The threads on the single client are writing to disjoint parts of the file (i.e. each thread has its own DLM extent that is not adjacent to the extents written by other threads on that client).

      For example, to reproduce this workload with 4 clients (A, B, C, D) against 2 OSTs (1, 2):

      Client ABCDABCDABCD...
      OST 121212121212...

      While this IOR test is running, other tests are also running on different clients to create a very heavy IO load on the OSTs.

      It may be that DLM locks on the OST are not getting any IO requests sent to refresh the DLM locks:

      • due to the number of active DLM locks on the client for a single OST being more than the number of rpcs in flight, some of the locks may be starved for sending BRW RPCs under that lock to the OST to refresh the lock timeout
      • due to the IO ordering of the BRW requests on the client, it may be that all of the pages for the lower-offset extent are sent to the OST before the pages for a higher-offset extent are ever sent
      • the high priority request queue on the OST may not be enough to help this if several locks on the client for one OST are canceled at the same time

      Some solutions that might help this (individually, or in combination):
      1. increase the max_rpcs_in_flight = core count, but I think this is bad in the long run since it can dramatically increase the number of RPCs that need to be handled at one time by each OST
      2. always allow at least one BRW RPC in flight for each lock that is being canceled
      3. prioritize ALL BRW RPCs for a blocked lock in advance of non-blocked BRW requests (e.g. like a high-priority request queue on the client)
      4. both (2) and (3) may be needed in order to avoid starvation as the client core count increases

      Attachments

        Activity

          [LU-917] shared single file client IO submission with many cores starves OST DLM locks due to max_rpcs_in_flight

          close old tickets

          jay Jinshan Xiong (Inactive) added a comment - close old tickets

          Xyratex MRP-455 posted in LU-1239 with patch.

          nrutman Nathan Rutman added a comment - Xyratex MRP-455 posted in LU-1239 with patch.

          Chris, I appreciate your concerns here. There are good reasons why we must keep our bug tracking system internal: the privacy of our customers; our time tracking and billing systems; our requirement to track non-Lustre bugs as well.
          Perhaps something could be set up to automatically mirror Lustre bug comments out to Whamcloud's system. Please email me directly nathan_rutman@xyratex.com for further discussion on this topic and let's leave this poor bug alone

          nrutman Nathan Rutman added a comment - Chris, I appreciate your concerns here. There are good reasons why we must keep our bug tracking system internal: the privacy of our customers; our time tracking and billing systems; our requirement to track non-Lustre bugs as well. Perhaps something could be set up to automatically mirror Lustre bug comments out to Whamcloud's system. Please email me directly nathan_rutman@xyratex.com for further discussion on this topic and let's leave this poor bug alone

          Nathan, it really does the community a disservice to keep your issues secret. Telling us an internal Xyratex ticket number is of no use to us.

          I can only imagine that working in secret like this would make it more difficult to get patches landed as well. If outside developers aren't tapped into the discussion about the issue all along, it just increases the burden on you to present a complete and detailed explanation of both the problem and the solution. Should there be a disagreement about approach, you may find that you've wasted your time.

          LLNL has the same issues of dealing with multiple trackers. It is just one that needs to be accepted, I think. We use our internal tracker to discuss and track issues with admins and users, but keep most of the the technical discussion in jira where the world can see it.

          morrone Christopher Morrone (Inactive) added a comment - Nathan, it really does the community a disservice to keep your issues secret. Telling us an internal Xyratex ticket number is of no use to us. I can only imagine that working in secret like this would make it more difficult to get patches landed as well. If outside developers aren't tapped into the discussion about the issue all along, it just increases the burden on you to present a complete and detailed explanation of both the problem and the solution. Should there be a disagreement about approach, you may find that you've wasted your time. LLNL has the same issues of dealing with multiple trackers. It is just one that needs to be accepted, I think. We use our internal tracker to discuss and track issues with admins and users, but keep most of the the technical discussion in jira where the world can see it.

          It's difficult to track progress in two different places; our primary tracker is our own internal Jira.

          nrutman Nathan Rutman added a comment - It's difficult to track progress in two different places; our primary tracker is our own internal Jira.

          Why wait until you are done? I'd certainly like to be made aware of the problem and progress as you go along in a new ticket.

          morrone Christopher Morrone (Inactive) added a comment - Why wait until you are done? I'd certainly like to be made aware of the problem and progress as you go along in a new ticket.

          There's a few different issues here; I agree the rpcs_in_flight scenario seems to be one problem, but I was more interested in the limited-server-thread problem (even if it's not causing LU-874) because it is causing other problems as well. For example, we're tracking a bug (MRP-455) where we experience cascading client evictions because all MDS threads are stuck pending ldlm enqueues, leaving no room for PING or CONNECT rpcs. (That one is a direct result of mishandled HP queue, but it made me realize we have no "wiggle room" in the code today. As with all our bugs, we'll submit it upstream when we're done.)

          nrutman Nathan Rutman added a comment - There's a few different issues here; I agree the rpcs_in_flight scenario seems to be one problem, but I was more interested in the limited-server-thread problem (even if it's not causing LU-874 ) because it is causing other problems as well. For example, we're tracking a bug (MRP-455) where we experience cascading client evictions because all MDS threads are stuck pending ldlm enqueues, leaving no room for PING or CONNECT rpcs. (That one is a direct result of mishandled HP queue, but it made me realize we have no "wiggle room" in the code today. As with all our bugs, we'll submit it upstream when we're done.)

          Nathan, the issue is that the client is only allowed a fixed number of outstanding rpcs to the ost. Lets call that N. Now lets assume that the OST is processing RPCs very slowly (minutes each), but otherwise operating normally.

          If the OST revokes N+1 locks from the client now, the client stands a real risk of being evicted. In order to avoid eviction the client must constantly have rpcs enqueued on the server for EACH of the revoked locks. (We fixed some things in LU-874 to help make even that work.) Otherwise one of the locks will time out, and the client will be evicted.

          This ticket is looking at ways to alleviate the problem from the client side. I do worry that these client side solutions increase the load on a server that is already heavily loaded.

          Ultimately, we need to look at making the OST smarter whether or not we decide that client side changes have value. The OST really needs to assume that if the client is making progress on other revoked locks, then it should extend all locks timers for that client in good faith.

          morrone Christopher Morrone (Inactive) added a comment - Nathan, the issue is that the client is only allowed a fixed number of outstanding rpcs to the ost. Lets call that N. Now lets assume that the OST is processing RPCs very slowly (minutes each), but otherwise operating normally. If the OST revokes N+1 locks from the client now, the client stands a real risk of being evicted. In order to avoid eviction the client must constantly have rpcs enqueued on the server for EACH of the revoked locks. (We fixed some things in LU-874 to help make even that work.) Otherwise one of the locks will time out, and the client will be evicted. This ticket is looking at ways to alleviate the problem from the client side. I do worry that these client side solutions increase the load on a server that is already heavily loaded. Ultimately, we need to look at making the OST smarter whether or not we decide that client side changes have value. The OST really needs to assume that if the client is making progress on other revoked locks, then it should extend all locks timers for that client in good faith.

          "the high priority request queue on the OST may not be enough to help this if several locks on the client for one OST are canceled at the same time"

          You mean the HP thread can't handle multiple cancel callbacks before some time out? I was wondering why we don't reserve more threads for HP reqs, or, alternately, limit the number of threads doing any 1 op (i.e. no more than 75% of threads can be doing ldlm ops, and no more than 75% of threads can be doing io ops), so that we "balance" the load a little better and don't get stuck in these corner cases.

          nrutman Nathan Rutman added a comment - "the high priority request queue on the OST may not be enough to help this if several locks on the client for one OST are canceled at the same time" You mean the HP thread can't handle multiple cancel callbacks before some time out? I was wondering why we don't reserve more threads for HP reqs, or, alternately, limit the number of threads doing any 1 op (i.e. no more than 75% of threads can be doing ldlm ops, and no more than 75% of threads can be doing io ops), so that we "balance" the load a little better and don't get stuck in these corner cases.

          People

            jay Jinshan Xiong (Inactive)
            adilger Andreas Dilger
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: