Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.16.0
-
None
-
3
-
9223372036854775807
Description
Ever since LU-1565 landed it looks like another subtle problem was introduced.
Since LDLM_CANCEL requests don't have any throttling, they could be sent in huge numbers, this is normally not a big problem as processing is relatively lightweight.
But in case there's a resend for such a cancel, suddenly we take this path in ptlrpc_server_check_resend_in_progress() added in LU-793:
ptlrpc_server_check_resend_in_progress(struct ptlrpc_request *req) { struct ptlrpc_request *tmp = NULL; if (!(lustre_msg_get_flags(req->rq_reqmsg) & MSG_RESENT)) return NULL; /* * This list should not be longer than max_requests in * flights on the client, so it is not all that long. * Also we only hit this codepath in case of a resent * request which makes it even more rarely hit */ list_for_each_entry(tmp, &req->rq_export->exp_reg_rpcs, rq_exp_list) { /* Found duplicate one */ if (tmp->rq_xid == req->rq_xid) goto found; } list_for_each_entry(tmp, &req->rq_export->exp_hp_rpcs, rq_exp_list) { /* Found duplicate one */ if (tmp->rq_xid == req->rq_xid)
a case was observed at a customer site where this was entered and the exp_hp_rpc lists for multiple client exports were in tens of thousands as mass cancellations (from lru timing out?) commenced which led to these list iterations under exp_rpc_lock taking really long time and now clogging the ldlm_cn threads so no processing wasreally possible, request delays from network acceptance to req_in handler could reach into tens of seconds as the result for all clients (same thread pool) with the expected disastrous results.
I imagine we need to address this from two directions:
- server side we really need to avoid taking long time for duplicate request search
- client side we already drop unused and actively cancelled ldlm locks, we need to also drop the cancel resends on replay somehow
Attachments
Issue Links
- is related to
-
LU-18077 Do not resend cancel requests over replay boundary
-
- Open
-
-
LU-18881 MDT overwhelmed by lock cancel requests
-
- Open
-
-
LU-18111 Don't drop expired cancel request
-
- Resolved
-
- is related to
-
LU-793 Reconnections should not be refused when there is a request in progress from this client.
-
- Resolved
-
-
LU-1565 lost LDLM_CANCEL RPCs
-
- Resolved
-
-
LU-15190 ptlrpc_server_check_resend_in_progress() can miss duplicate RPC
-
- Resolved
-
They are not processing fast because they have the resent flag, so instead of being processed, all the cn threads are hogging the cpu fighting to get the exp_rpc_lock spinlock for their chance to iterate the incoming RPCs list searching for duplicates. Available core files show that there are tens of thousands of requests on those lists from dozen or so clients. This takes a lot of CPU. While the spinlock is separate per client, I imagine those RPCs arrive in batches right after a reconnect.
Dropping them on reconnect server-side certainly makes sense because at least we quickly trim the list, but still need sensible resent handling esp. for portals where it does nto help to reconnect to old requests like the cancel portal.
Personally I think optimizing of duplicates search is not going to buy us as much as not running it where it makes no sense at all.
The other thing is considering duplicates search really only makes sense for requests that are in processing, why don't we have a separate lists for such requests? that one is sure going to be super short (limited by number of server threads at the most).
If we go back to
LU-793- the whole idea there was that for long running requests we want to attach resends into the existing processing so the response is sent as a reply to the resent request and is actually delivered, instead of us retryign the real expensive operation anyway.Doing this for requests that are not yet starting processing is counterproductive for two reasons: