Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.3.0
    • None
    • None
    • 3
    • 4565

    Description

      recently I have found the following scenario that may lead to cascading client reconnects, lock timeouts, evictions, etc.

      1. MDS is overloaded with enqueues, they consume all the threads on MDS_REQUEST portal.
      2. it happened that some rpc timed out on 1 client what led to its reconnection. this client has some locks to cancel, MDS is waiting for them.
      3. client sends MDS_CONNECT, but there is no empty thread to handle it.
      4. other clients are waiting for their enqueue completions, they try to ping MDS if it is still alive, but PING is also sent to MDS_REQUEST portal, despite the fact it is a high priority rpc, it has no special handlers (srv_hpreq_handler == NULL) and therefore 2nd thread is not reserved for hi-priority rpcs on such services:

      static int ptlrpc_server_allow_normal(struct ptlrpc_service *svc, int force)
      {
      #ifndef __KERNEL__
              if (1) /* always allow to handle normal request for liblustre */
                      return 1;
      #endif
              if (force ||
                  svc->srv_n_active_reqs < svc->srv_threads_running - 2)
                      return 1;
      
              if (svc->srv_n_active_reqs >= svc->srv_threads_running - 1)
                      return 0;
      
              return svc->srv_n_active_hpreq > 0 || svc->srv_hpreq_handler == NULL;
      }
      

      no thread to handle pings - other clients get timed out rpc.
      6. once 1 ldlm lock times out, enqueue completes and an MDS_CONNECT may be taken into handling, however this client is likely to have an enqueue rpc in processing on MDS, thus it gets ebusy and will re-try only after some delay, whereas others tries to re-connect and consume MDS threads by enqueues
      again. this is being discussed in LU-7, but it is not the main issue here.

      fixes:
      1) reserve an extra threads on services which expect PINGS to come.
      2) make CONNECTs hi-priority RPCs.
      3) LU-7 to address (6)

      Attachments

        Issue Links

          Activity

            [LU-1239] cascading client evictions

            Xyratex MRP-455

            nrutman Nathan Rutman added a comment - Xyratex MRP-455
            spitzcor Cory Spitz added a comment -

            Chris, yes the patch is suitable for 2.1. Cray initially found this bug on 2.1 and Vitaly developed the fix for Xyratex's 2.1+patches: https://github.com/Xyratex/lustre-stable/commit/afcf3cf1091c67d076ef36dc0d73cd649f84421e.

            spitzcor Cory Spitz added a comment - Chris, yes the patch is suitable for 2.1. Cray initially found this bug on 2.1 and Vitaly developed the fix for Xyratex's 2.1+patches: https://github.com/Xyratex/lustre-stable/commit/afcf3cf1091c67d076ef36dc0d73cd649f84421e .

            Is this patch suitable for 2.1? I think we're seeing the same cascading client evictions there.

            morrone Christopher Morrone (Inactive) added a comment - Is this patch suitable for 2.1? I think we're seeing the same cascading client evictions there.
            pjones Peter Jones added a comment -

            Landed for 2.3

            pjones Peter Jones added a comment - Landed for 2.3

            sorry, one month. I didn't realize it had gone through some revisions.

            nrutman Nathan Rutman added a comment - sorry, one month. I didn't realize it had gone through some revisions.

            This patch has been sitting here for two months with no review - what should we do with it?

            nrutman Nathan Rutman added a comment - This patch has been sitting here for two months with no review - what should we do with it?

            Lustre 2.10 server, Lustre 1.8.6 clients.
            4000 nodes simultaneously trying to mkdir -p the same directory in the lustre root dir and create a (distinct) file in that dir.

            The MDS_CONNECT and OST_CONNECT RPCs should only be high priority if they are reconnects, not if they are initial connects. Please add a check for MSG_CONNECT_RECONNECT instead of just making all CONNECT requests high priority.

            Why can't they all be high priority? They should take 0 time to process on the MDS relative to anything else, and we want a responsive client mount command. And it's simpler code.

            nrutman Nathan Rutman added a comment - Lustre 2.10 server, Lustre 1.8.6 clients. 4000 nodes simultaneously trying to mkdir -p the same directory in the lustre root dir and create a (distinct) file in that dir. The MDS_CONNECT and OST_CONNECT RPCs should only be high priority if they are reconnects, not if they are initial connects. Please add a check for MSG_CONNECT_RECONNECT instead of just making all CONNECT requests high priority. Why can't they all be high priority? They should take 0 time to process on the MDS relative to anything else, and we want a responsive client mount command. And it's simpler code.

            What version of Lustre hit this problem, and what kind of workload blocked all of the MDS threads?

            adilger Andreas Dilger added a comment - What version of Lustre hit this problem, and what kind of workload blocked all of the MDS threads?
            vitaly_fertman Vitaly Fertman added a comment - http://review.whamcloud.com/2355

            People

              wc-triage WC Triage
              vitaly_fertman Vitaly Fertman
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: