[LU-1239] cascading client evictions Created: 20/Mar/12  Updated: 17/Mar/15  Resolved: 25/Jul/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.3.0

Type: Bug Priority: Minor
Reporter: Vitaly Fertman Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-7 Reconnect server->client connection Resolved
Severity: 3
Rank (Obsolete): 4565

 Description   

recently I have found the following scenario that may lead to cascading client reconnects, lock timeouts, evictions, etc.

1. MDS is overloaded with enqueues, they consume all the threads on MDS_REQUEST portal.
2. it happened that some rpc timed out on 1 client what led to its reconnection. this client has some locks to cancel, MDS is waiting for them.
3. client sends MDS_CONNECT, but there is no empty thread to handle it.
4. other clients are waiting for their enqueue completions, they try to ping MDS if it is still alive, but PING is also sent to MDS_REQUEST portal, despite the fact it is a high priority rpc, it has no special handlers (srv_hpreq_handler == NULL) and therefore 2nd thread is not reserved for hi-priority rpcs on such services:

static int ptlrpc_server_allow_normal(struct ptlrpc_service *svc, int force)
{
#ifndef __KERNEL__
        if (1) /* always allow to handle normal request for liblustre */
                return 1;
#endif
        if (force ||
            svc->srv_n_active_reqs < svc->srv_threads_running - 2)
                return 1;

        if (svc->srv_n_active_reqs >= svc->srv_threads_running - 1)
                return 0;

        return svc->srv_n_active_hpreq > 0 || svc->srv_hpreq_handler == NULL;
}

no thread to handle pings - other clients get timed out rpc.
6. once 1 ldlm lock times out, enqueue completes and an MDS_CONNECT may be taken into handling, however this client is likely to have an enqueue rpc in processing on MDS, thus it gets ebusy and will re-try only after some delay, whereas others tries to re-connect and consume MDS threads by enqueues
again. this is being discussed in LU-7, but it is not the main issue here.

fixes:
1) reserve an extra threads on services which expect PINGS to come.
2) make CONNECTs hi-priority RPCs.
3) LU-7 to address (6)



 Comments   
Comment by Vitaly Fertman [ 20/Mar/12 ]

http://review.whamcloud.com/2355

Comment by Andreas Dilger [ 05/Apr/12 ]

What version of Lustre hit this problem, and what kind of workload blocked all of the MDS threads?

Comment by Nathan Rutman [ 05/Apr/12 ]

Lustre 2.10 server, Lustre 1.8.6 clients.
4000 nodes simultaneously trying to mkdir -p the same directory in the lustre root dir and create a (distinct) file in that dir.

The MDS_CONNECT and OST_CONNECT RPCs should only be high priority if they are reconnects, not if they are initial connects. Please add a check for MSG_CONNECT_RECONNECT instead of just making all CONNECT requests high priority.

Why can't they all be high priority? They should take 0 time to process on the MDS relative to anything else, and we want a responsive client mount command. And it's simpler code.

Comment by Nathan Rutman [ 18/May/12 ]

This patch has been sitting here for two months with no review - what should we do with it?

Comment by Nathan Rutman [ 18/May/12 ]

sorry, one month. I didn't realize it had gone through some revisions.

Comment by Peter Jones [ 25/Jul/12 ]

Landed for 2.3

Comment by Christopher Morrone [ 02/Aug/12 ]

Is this patch suitable for 2.1? I think we're seeing the same cascading client evictions there.

Comment by Cory Spitz [ 03/Aug/12 ]

Chris, yes the patch is suitable for 2.1. Cray initially found this bug on 2.1 and Vitaly developed the fix for Xyratex's 2.1+patches: https://github.com/Xyratex/lustre-stable/commit/afcf3cf1091c67d076ef36dc0d73cd649f84421e.

Comment by Nathan Rutman [ 21/Nov/12 ]

Xyratex MRP-455

Generated at Sat Feb 10 01:14:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.