[LU-18072] Lock cancel resending overwhelms ldlm canceld thread - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.16.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Ever since ~~LU-1565~~ landed it looks like another subtle problem was introduced.

Since LDLM_CANCEL requests don't have any throttling, they could be sent in huge numbers, this is normally not a big problem as processing is relatively lightweight.

But in case there's a resend for such a cancel, suddenly we take this path in ptlrpc_server_check_resend_in_progress() added in ~~LU-793~~:

ptlrpc_server_check_resend_in_progress(struct ptlrpc_request *req)
{
        struct ptlrpc_request *tmp = NULL;        if (!(lustre_msg_get_flags(req->rq_reqmsg) & MSG_RESENT))
                return NULL;        /*
         * This list should not be longer than max_requests in
         * flights on the client, so it is not all that long.
         * Also we only hit this codepath in case of a resent
         * request which makes it even more rarely hit
         */
        list_for_each_entry(tmp, &req->rq_export->exp_reg_rpcs,
                                rq_exp_list) {
                /* Found duplicate one */
                if (tmp->rq_xid == req->rq_xid)
                        goto found;
        }
        list_for_each_entry(tmp, &req->rq_export->exp_hp_rpcs,
                                rq_exp_list) {
                /* Found duplicate one */
                if (tmp->rq_xid == req->rq_xid)

a case was observed at a customer site where this was entered and the exp_hp_rpc lists for multiple client exports were in tens of thousands as mass cancellations (from lru timing out?) commenced which led to these list iterations under exp_rpc_lock taking really long time and now clogging the ldlm_cn threads so no processing wasreally possible, request delays from network acceptance to req_in handler could reach into tens of seconds as the result for all clients (same thread pool) with the expected disastrous results.

I imagine we need to address this from two directions:

server side we really need to avoid taking long time for duplicate request search
client side we already drop unused and actively cancelled ldlm locks, we need to also drop the cancel resends on replay somehow

Attachments

Issue Links

is related to

LU-18077 Do not resend cancel requests over replay boundary

Open

LU-18881 MDT overwhelmed by lock cancel requests

Open

LU-18111 Don't drop expired cancel request

Resolved

is related to

LU-793 Reconnections should not be refused when there is a request in progress from this client.

Resolved

LU-1565 lost LDLM_CANCEL RPCs

Resolved

LU-15190 ptlrpc_server_check_resend_in_progress() can miss duplicate RPC

Resolved

(1 is related to )

Activity

[LU-18072] Lock cancel resending overwhelms ldlm canceld thread

Mikhail Pershin added a comment - 29/Jul/24 8:15 PM - edited

then the question is why they are not processing fast but staying in incoming queue. Is that canceld overload and busy? Or that is result of NRS policy which doesn't prioritize cancels properly? Still, I think if we can't control client flow, then we need to control it on server, why to fill memory with tons incoming cancels from all clients along with all their resent duplicates? Even if they can be dropped, OOM on server is still worse than -EBUSY response for clients with too many requests waiting

Then I'd think about moving to rhash to find duplicates - that way we would avoid duplicates - less memory usage as result and search will be in reasonable time, sort of compromise. I still have no whole picture for this, only some ideas about possible approaches

Mikhail Pershin added a comment - 29/Jul/24 8:15 PM - edited then the question is why they are not processing fast but staying in incoming queue. Is that canceld overload and busy? Or that is result of NRS policy which doesn't prioritize cancels properly? Still, I think if we can't control client flow, then we need to control it on server, why to fill memory with tons incoming cancels from all clients along with all their resent duplicates? Even if they can be dropped, OOM on server is still worse than -EBUSY response for clients with too many requests waiting Then I'd think about moving to rhash to find duplicates - that way we would avoid duplicates - less memory usage as result and search will be in reasonable time, sort of compromise. I still have no whole picture for this, only some ideas about possible approaches

Oleg Drokin added a comment - 29/Jul/24 7:43 PM

plus we used to respect max-rpcs-in-flight on the clients in the past?

ldlm cancels never had max RIF, if you have to send a lock cancel and there are no slots, what are you going to do? wait it out? that's the most sure way to get evicted for unresponsiveness there is.

That's why they go to their own separate portal - so other requests don't affect cancellations and the other way around - cancellations don't affect the other requests

Oleg Drokin added a comment - 29/Jul/24 7:43 PM plus we used to respect max-rpcs-in-flight on the clients in the past? ldlm cancels never had max RIF, if you have to send a lock cancel and there are no slots, what are you going to do? wait it out? that's the most sure way to get evicted for unresponsiveness there is. That's why they go to their own separate portal - so other requests don't affect cancellations and the other way around - cancellations don't affect the other requests

Alex Zhuravlev added a comment - 29/Jul/24 5:34 PM

Server cannot stop the flood, we can stop accepting requests in some cases

right, server can reject PRCs. plus we used to respect max-rpcs-in-flight on the clients in the past?

Alex Zhuravlev added a comment - 29/Jul/24 5:34 PM Server cannot stop the flood, we can stop accepting requests in some cases right, server can reject PRCs. plus we used to respect max-rpcs-in-flight on the clients in the past?

Oleg Drokin added a comment - 29/Jul/24 5:22 PM

Server cannot stop the flood, we can stop accepting requests in some cases, though right now we don't even have a counter of how many requests we have in any particular unprocesses queue it looks like, so making that decision is hard.

Oleg Drokin added a comment - 29/Jul/24 5:22 PM Server cannot stop the flood, we can stop accepting requests in some cases, though right now we don't even have a counter of how many requests we have in any particular unprocesses queue it looks like, so making that decision is hard.

Mikhail Pershin added a comment - 29/Jul/24 5:07 PM - edited

I'd say sever needs also own protection from such thing, not capability to serve in timely manner but also limits for such DDOS - we can limit size of incoming queue and just return any new resent without any duplication check with -EBUSY if that client export has too many waiting requests. I.e. 'external' flood control for client on server

Mikhail Pershin added a comment - 29/Jul/24 5:07 PM - edited I'd say sever needs also own protection from such thing, not capability to serve in timely manner but also limits for such DDOS - we can limit size of incoming queue and just return any new resent without any duplication check with -EBUSY if that client export has too many waiting requests. I.e. 'external' flood control for client on server

Alex Zhuravlev added a comment - 29/Jul/24 4:58 PM

well, enough clients send their huge rpc batches and it's just matter of number of clients to hit timeout-reconnect-resend, right? of course it's server who must stop this flood.

Alex Zhuravlev added a comment - 29/Jul/24 4:58 PM well, enough clients send their huge rpc batches and it's just matter of number of clients to hit timeout-reconnect-resend, right? of course it's server who must stop this flood .

Oleg Drokin added a comment - 29/Jul/24 4:55 PM

While this is indeed not ideal and we definitely need to make client-side improvements, servers should be robust enough in the face of somewhat adversary network traffic anyway so handling it sensibly server side is still important.

Oleg Drokin added a comment - 29/Jul/24 4:55 PM While this is indeed not ideal and we definitely need to make client-side improvements, servers should be robust enough in the face of somewhat adversary network traffic anyway so handling it sensibly server side is still important.

Alex Zhuravlev added a comment - 29/Jul/24 4:50 PM

IMO, this is not quite correct that we let clients to DDoS servers essentially - due to lack of any flow control.

Alex Zhuravlev added a comment - 29/Jul/24 4:50 PM IMO, this is not quite correct that we let clients to DDoS servers essentially - due to lack of any flow control.

Oleg Drokin added a comment - 29/Jul/24 4:44 PM

yes, I agree. Hence why I originally proposed not applying this logic to some services like cancels, but that clearly is not going to be enough alone.

The idea to drop entire queue on reconnect sounds promising here in addition to that?

Should we add a counter for such a "totally unprocessed" queue and if it gets high print something, that way we can see if any other services are affected potentially?

Oleg Drokin added a comment - 29/Jul/24 4:44 PM yes, I agree. Hence why I originally proposed not applying this logic to some services like cancels, but that clearly is not going to be enough alone. The idea to drop entire queue on reconnect sounds promising here in addition to that? Should we add a counter for such a "totally unprocessed" queue and if it gets high print something, that way we can see if any other services are affected potentially?

Alex Zhuravlev added a comment - 29/Jul/24 4:35 PM

the original algo was not supposed to handle tens of thousands requests in flight:

	/*
	 * This list should not be longer than max_requests in
	 * flights on the client, so it is not all that long.
	 * Also we only hit this codepath in case of a resent
	 * request which makes it even more rarely hit
	 */

so we have to adapt it to the new requirement..

Alex Zhuravlev added a comment - 29/Jul/24 4:35 PM the original algo was not supposed to handle tens of thousands requests in flight: /* * This list should not be longer than max_requests in * flights on the client, so it is not all that long . * Also we only hit this codepath in case of a resent * request which makes it even more rarely hit */ so we have to adapt it to the new requirement..

Oleg Drokin added a comment - 29/Jul/24 4:29 PM

tappro But does it really make sense to find duplicates in the not yet started to be processed list anyway? All the duplicate search does is reconnecting old request and new request so the client will get a result of serving old request to new request matchbits, alas it's actually counterproductive for yet not started requests because once we start processign them they are dropped because gneration is wrong?
"DROPPING req from old connection" from ptlrpc_check_req(), and indeed when we check client logs we see the same xid from the same request being rejected like this after the duplicate was found, so the new request is rejected as if it's old?

I agree dropping entire incoming queue on reconnect sound like a godo idea, probably should file a separate ticket for it?

Oleg Drokin added a comment - 29/Jul/24 4:29 PM tappro But does it really make sense to find duplicates in the not yet started to be processed list anyway? All the duplicate search does is reconnecting old request and new request so the client will get a result of serving old request to new request matchbits, alas it's actually counterproductive for yet not started requests because once we start processign them they are dropped because gneration is wrong? "DROPPING req from old connection" from ptlrpc_check_req(), and indeed when we check client logs we see the same xid from the same request being rejected like this after the duplicate was found, so the new request is rejected as if it's old? I agree dropping entire incoming queue on reconnect sound like a godo idea, probably should file a separate ticket for it?

People

Assignee:: Oleg Drokin

Reporter:: Oleg Drokin

Votes:: 0 Vote for this issue

Watchers:: 23 Start watching this issue

Dates

Created:: 27/Jul/24 2:02 AM

Updated:: Yesterday 10:34 PM