Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.16.0
-
None
-
3
-
9223372036854775807
Description
Ever since LU-1565 landed it looks like another subtle problem was introduced.
Since LDLM_CANCEL requests don't have any throttling, they could be sent in huge numbers, this is normally not a big problem as processing is relatively lightweight.
But in case there's a resend for such a cancel, suddenly we take this path in ptlrpc_server_check_resend_in_progress() added in LU-793:
ptlrpc_server_check_resend_in_progress(struct ptlrpc_request *req) { struct ptlrpc_request *tmp = NULL; if (!(lustre_msg_get_flags(req->rq_reqmsg) & MSG_RESENT)) return NULL; /* * This list should not be longer than max_requests in * flights on the client, so it is not all that long. * Also we only hit this codepath in case of a resent * request which makes it even more rarely hit */ list_for_each_entry(tmp, &req->rq_export->exp_reg_rpcs, rq_exp_list) { /* Found duplicate one */ if (tmp->rq_xid == req->rq_xid) goto found; } list_for_each_entry(tmp, &req->rq_export->exp_hp_rpcs, rq_exp_list) { /* Found duplicate one */ if (tmp->rq_xid == req->rq_xid)
a case was observed at a customer site where this was entered and the exp_hp_rpc lists for multiple client exports were in tens of thousands as mass cancellations (from lru timing out?) commenced which led to these list iterations under exp_rpc_lock taking really long time and now clogging the ldlm_cn threads so no processing wasreally possible, request delays from network acceptance to req_in handler could reach into tens of seconds as the result for all clients (same thread pool) with the expected disastrous results.
I imagine we need to address this from two directions:
- server side we really need to avoid taking long time for duplicate request search
- client side we already drop unused and actively cancelled ldlm locks, we need to also drop the cancel resends on replay somehow