Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.16.0
-
None
-
3
-
9223372036854775807
Description
Ever since LU-1565 landed it looks like another subtle problem was introduced.
Since LDLM_CANCEL requests don't have any throttling, they could be sent in huge numbers, this is normally not a big problem as processing is relatively lightweight.
But in case there's a resend for such a cancel, suddenly we take this path in ptlrpc_server_check_resend_in_progress() added in LU-793:
ptlrpc_server_check_resend_in_progress(struct ptlrpc_request *req) { struct ptlrpc_request *tmp = NULL; if (!(lustre_msg_get_flags(req->rq_reqmsg) & MSG_RESENT)) return NULL; /* * This list should not be longer than max_requests in * flights on the client, so it is not all that long. * Also we only hit this codepath in case of a resent * request which makes it even more rarely hit */ list_for_each_entry(tmp, &req->rq_export->exp_reg_rpcs, rq_exp_list) { /* Found duplicate one */ if (tmp->rq_xid == req->rq_xid) goto found; } list_for_each_entry(tmp, &req->rq_export->exp_hp_rpcs, rq_exp_list) { /* Found duplicate one */ if (tmp->rq_xid == req->rq_xid)
a case was observed at a customer site where this was entered and the exp_hp_rpc lists for multiple client exports were in tens of thousands as mass cancellations (from lru timing out?) commenced which led to these list iterations under exp_rpc_lock taking really long time and now clogging the ldlm_cn threads so no processing wasreally possible, request delays from network acceptance to req_in handler could reach into tens of seconds as the result for all clients (same thread pool) with the expected disastrous results.
I imagine we need to address this from two directions:
- server side we really need to avoid taking long time for duplicate request search
- client side we already drop unused and actively cancelled ldlm locks, we need to also drop the cancel resends on replay somehow
I tend to agree with Oleg here, but would look at it wider - why in this case to search for dups for RPCs without a tag at all? if so, it is not a matter of portals; and moreover - in this case the list becomes really small - as it was supposed to be originally, and therefore the benefit of dropping the incoming RPC queue (Alex's proposal), is very limited and probably no much sense to bother with it.
iiuc, the main idea of this code was to search for dups is to not execute modification RPC twice, what may lead to corruptions. this would be still in place. the 2nd benefit was (theoretical, iiuc, not confirmed by tests) to not let heavy weight not-modification RPCs to be processed twice. however, in this (2nd) case, if we have already sent a reply, we are not able to detect it is a dup and would handle it twice anyway (only relying on other dropping mechanisms if applicable); if we have not sent a reply yet - all the file data are (almost always) still in mem, and the 2nd handling supposed to be relatively fast.
whereas the original idea (of still detecting such dups for all RPCs) looked reasonable, it resulted in very long lists and completely blocked MDS's, and relatively fast 2nd handling looks much faster here than searching dup for (if already not all RPCs, but still) many other not modification RPCs.