[LU-18072] Lock cancel resending overwhelms ldlm canceld thread - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.17.0, Lustre 2.15.7
Affects Version/s: Lustre 2.16.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Ever since ~~LU-1565~~ landed it looks like another subtle problem was introduced.

Since LDLM_CANCEL requests don't have any throttling, they could be sent in huge numbers, this is normally not a big problem as processing is relatively lightweight.

But in case there's a resend for such a cancel, suddenly we take this path in ptlrpc_server_check_resend_in_progress() added in ~~LU-793~~:

ptlrpc_server_check_resend_in_progress(struct ptlrpc_request *req)
{
        struct ptlrpc_request *tmp = NULL;        if (!(lustre_msg_get_flags(req->rq_reqmsg) & MSG_RESENT))
                return NULL;        /*
         * This list should not be longer than max_requests in
         * flights on the client, so it is not all that long.
         * Also we only hit this codepath in case of a resent
         * request which makes it even more rarely hit
         */
        list_for_each_entry(tmp, &req->rq_export->exp_reg_rpcs,
                                rq_exp_list) {
                /* Found duplicate one */
                if (tmp->rq_xid == req->rq_xid)
                        goto found;
        }
        list_for_each_entry(tmp, &req->rq_export->exp_hp_rpcs,
                                rq_exp_list) {
                /* Found duplicate one */
                if (tmp->rq_xid == req->rq_xid)

a case was observed at a customer site where this was entered and the exp_hp_rpc lists for multiple client exports were in tens of thousands as mass cancellations (from lru timing out?) commenced which led to these list iterations under exp_rpc_lock taking really long time and now clogging the ldlm_cn threads so no processing wasreally possible, request delays from network acceptance to req_in handler could reach into tens of seconds as the result for all clients (same thread pool) with the expected disastrous results.

I imagine we need to address this from two directions:

server side we really need to avoid taking long time for duplicate request search
client side we already drop unused and actively cancelled ldlm locks, we need to also drop the cancel resends on replay somehow

Attachments

Issue Links

is related to

LU-18077 Do not resend cancel requests over replay boundary

Open

LU-18111 Don't drop expired cancel request

Resolved

LU-18881 MDT overwhelmed by lock cancel requests

Resolved

is related to

LU-793 Reconnections should not be refused when there is a request in progress from this client.

Resolved

LU-1565 lost LDLM_CANCEL RPCs

Resolved

LU-15190 ptlrpc_server_check_resend_in_progress() can miss duplicate RPC

Resolved

(1 is related to )

Activity

People

Assignee:: Oleg Drokin

Reporter:: Oleg Drokin

Votes:: 0 Vote for this issue

Watchers:: 27 Start watching this issue

Dates

Created:: 27/Jul/24 2:02 AM

Updated:: 25/Jul/25 6:08 PM

Resolved:: 25/Apr/25 3:18 PM