Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18072

Lock cancel resending overwhelms ldlm canceld thread

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.17.0
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      Ever since LU-1565 landed it looks like another subtle problem was introduced.

      Since LDLM_CANCEL requests don't have any throttling, they could be sent in huge numbers, this is normally not a big problem as processing is relatively lightweight.

      But in case there's a resend for such a cancel, suddenly we take this path in ptlrpc_server_check_resend_in_progress() added in LU-793:

      ptlrpc_server_check_resend_in_progress(struct ptlrpc_request *req)
      {
              struct ptlrpc_request *tmp = NULL;        if (!(lustre_msg_get_flags(req->rq_reqmsg) & MSG_RESENT))
                      return NULL;        /*
               * This list should not be longer than max_requests in
               * flights on the client, so it is not all that long.
               * Also we only hit this codepath in case of a resent
               * request which makes it even more rarely hit
               */
              list_for_each_entry(tmp, &req->rq_export->exp_reg_rpcs,
                                      rq_exp_list) {
                      /* Found duplicate one */
                      if (tmp->rq_xid == req->rq_xid)
                              goto found;
              }
              list_for_each_entry(tmp, &req->rq_export->exp_hp_rpcs,
                                      rq_exp_list) {
                      /* Found duplicate one */
                      if (tmp->rq_xid == req->rq_xid)
      

      a case was observed at a customer site where this was entered and the exp_hp_rpc lists for multiple client exports were in tens of thousands as mass cancellations (from lru timing out?) commenced which led to these list iterations under exp_rpc_lock taking really long time and now clogging the ldlm_cn threads so no processing wasreally possible, request delays from network acceptance to req_in handler could reach into tens of seconds as the result for all clients (same thread pool) with the expected disastrous results.

       

      I imagine we need to address this from two directions:

      1. server side we really need to avoid taking long time for duplicate request search
      2. client side we already drop unused and actively cancelled ldlm locks, we need to also drop the cancel resends on replay somehow

      Attachments

        Issue Links

          Activity

            [LU-18072] Lock cancel resending overwhelms ldlm canceld thread
            green Oleg Drokin added a comment -

            While this is indeed not ideal and we definitely need to make client-side improvements, servers should be robust enough in the face of somewhat adversary network traffic anyway so handling it sensibly server side is still important.

            green Oleg Drokin added a comment - While this is indeed not ideal and we definitely need to make client-side improvements, servers should be robust enough in the face of somewhat adversary network traffic anyway so handling it sensibly server side is still important.

            IMO, this is not quite correct that we let clients to DDoS servers essentially - due to lack of any flow control.

            bzzz Alex Zhuravlev added a comment - IMO, this is not quite correct that we let clients to DDoS servers essentially - due to lack of any flow control.
            green Oleg Drokin added a comment -

            yes, I agree. Hence why I originally proposed not applying this logic to some services like cancels, but that clearly is not going to be enough alone.

            The idea to drop entire queue on reconnect sounds promising here in addition to that?

            Should we add a counter for such a "totally unprocessed" queue and if it gets high print something, that way we can see if any other services are affected potentially?

            green Oleg Drokin added a comment - yes, I agree. Hence why I originally proposed not applying this logic to some services like cancels, but that clearly is not going to be enough alone. The idea to drop entire queue on reconnect sounds promising here in addition to that? Should we add a counter for such a "totally unprocessed" queue and if it gets high print something, that way we can see if any other services are affected potentially?

            the original algo was not supposed to handle tens of thousands requests in flight:

            	/*
            	 * This list should not be longer than max_requests in
            	 * flights on the client, so it is not all that long.
            	 * Also we only hit this codepath in case of a resent
            	 * request which makes it even more rarely hit
            	 */
            

            so we have to adapt it to the new requirement..

            bzzz Alex Zhuravlev added a comment - the original algo was not supposed to handle tens of thousands requests in flight: /* * This list should not be longer than max_requests in * flights on the client, so it is not all that long . * Also we only hit this codepath in case of a resent * request which makes it even more rarely hit */ so we have to adapt it to the new requirement..
            green Oleg Drokin added a comment -

            tappro But does it really make sense to find duplicates in the not yet started to be processed list anyway? All the duplicate search does is reconnecting old request and new request so the client will get a result of serving old request to new request matchbits, alas it's actually counterproductive for yet not started requests because once we start processign them they are dropped because gneration is wrong?
            "DROPPING req from old connection" from ptlrpc_check_req(), and indeed when we check client logs we see the same xid from the same request being rejected like this after the duplicate was found, so the new request is rejected as if it's old?

             

            I agree dropping entire incoming queue on reconnect sound like a godo idea, probably should file a separate ticket for it?

            green Oleg Drokin added a comment - tappro But does it really make sense to find duplicates in the not yet started to be processed list anyway? All the duplicate search does is reconnecting old request and new request so the client will get a result of serving old request to new request matchbits, alas it's actually counterproductive for yet not started requests because once we start processign them they are dropped because gneration is wrong? "DROPPING req from old connection" from ptlrpc_check_req(), and indeed when we check client logs we see the same xid from the same request being rejected like this after the duplicate was found, so the new request is rejected as if it's old?   I agree dropping entire incoming queue on reconnect sound like a godo idea, probably should file a separate ticket for it?

            following Oleg's idea to don't bother about old generation request, I agree with Alex above comment, but again - if we are going to check only request in processing (not waiting) then these lists are not about that. Either we need another list of request whic are exactly in processing (coherent with inc/dec of exp_rpc_count) or check request state/flag to make sure it is in processing

            tappro Mikhail Pershin added a comment - following Oleg's idea to don't bother about old generation request, I agree with Alex above comment, but again - if we are going to check only request in processing (not waiting) then these lists are not about that. Either we need another list of request whic are exactly in processing (coherent with inc/dec of exp_rpc_count) or check request state/flag to make sure it is in processing
            bzzz Alex Zhuravlev added a comment - - edited

            probably we could drop request from the lists in ptlrpc_server_check_resend_in_progress() if generation is old and request is not being processed at the moment? or move to another list where "dropping" could be done lockless.
            but ideally we should move the whole list upon generation change.

            bzzz Alex Zhuravlev added a comment - - edited probably we could drop request from the lists in ptlrpc_server_check_resend_in_progress() if generation is old and request is not being processed at the moment? or move to another list where "dropping" could be done lockless. but ideally we should move the whole list upon generation change.

            That is not quite correct understanding of what LU-15190 is doing. Its purpose was to prevent accumulation of resends with same xid in these request lists checked in ptlrpc_server_check_resend_in_progress(). So while your statement 1) is true, it misses that if we have zero exp_rpc_count we will add each resend in the list as we don't check for it, so that list become longer by duplicated resends and makes future checks more troublesome as there is more items to check. That is why statement 2) is not true - iterating each time makes things better because it doesn't allow list to grow with useless duplicates

            As I was thinking about LU-15190 - it fixes LU-793 which has used exp_rpc_count just as fast way to check both lists are empty, but that is not the same - both lists may have many requests waiting for processing but exp_rpc_count is not about that - it is incremented only when request is taken from that list to be handled, so we might have 10 requests in list and 1 being processing, as it finished exp_rpc_count become 0 and resend check is not done anymore though lists are not empty. So my understanding is that exp_rpc_count can't be used to decide if we have requests in lists or no, and we can just go though these lists directly - if they are empty that is almost for free. So the main purpose of LU-15190 is to don't use exp_rpc_count to check if lists are empty as they are not coherent

            Note also, about you statement 1) - it is not true that it finds request being processed - it always finds request in the list which are all incoming requests, so basically it just get it by XID and never check its state - is it waiting or being processed

            Situation you have with too long check in ptlrpc_server_check_resend_in_progress() is a bit different I think and happens because there are many waiting requests in these lists which is result of request spamming from client. Nevertheless, there was Alex patch in LU-15190 also to use rhash instead of list walk, so if we consider such spamming as allowed client behavior that is right way to go - just make XID search faster

            tappro Mikhail Pershin added a comment - That is not quite correct understanding of what LU-15190 is doing. Its purpose was to prevent accumulation of resends with same xid in these request lists checked in ptlrpc_server_check_resend_in_progress(). So while your statement 1) is true, it misses that if we have zero exp_rpc_count we will add each resend in the list as we don't check for it, so that list become longer by duplicated resends and makes future checks more troublesome as there is more items to check. That is why statement 2) is not true - iterating each time makes things better because it doesn't allow list to grow with useless duplicates As I was thinking about LU-15190 - it fixes LU-793 which has used exp_rpc_count just as fast way to check both lists are empty, but that is not the same - both lists may have many requests waiting for processing but exp_rpc_count is not about that - it is incremented only when request is taken from that list to be handled, so we might have 10 requests in list and 1 being processing, as it finished exp_rpc_count become 0 and resend check is not done anymore though lists are not empty. So my understanding is that exp_rpc_count can't be used to decide if we have requests in lists or no, and we can just go though these lists directly - if they are empty that is almost for free. So the main purpose of LU-15190 is to don't use exp_rpc_count to check if lists are empty as they are not coherent Note also, about you statement 1) - it is not true that it finds request being processed - it always finds request in the list which are all incoming requests, so basically it just get it by XID and never check its state - is it waiting or being processed Situation you have with too long check in ptlrpc_server_check_resend_in_progress() is a bit different I think and happens because there are many waiting requests in these lists which is result of request spamming from client. Nevertheless, there was Alex patch in LU-15190 also to use rhash instead of list walk, so if we consider such spamming as allowed client behavior that is right way to go - just make XID search faster
            green Oleg Drokin added a comment -

            Here's why I think LU-15190 is wrong and why the crashdump does not demonstrate a problem:

            1. The duplicate xid finding code was added as part of LU-793 to find a request that is already being processed and connect the new incoming resend to it so the client will get a reply into the new buffer when the current processing ends. If we have zero requests being processed currently, then there's nothing to reconnect so lookign for a duplicate is a waste of effort.

            2. Sure, if there's any RPC being processed and we get a resend we will still iterate over the list  and it might be slow if it's long, but iterating every time does not make it better for sure. I think we might want to add extra checks to exclude certain servics like canceld from this check at all but that does not address the item #1 that if there's nothing in processing then there's no need to match the xid.

             

            4. We cannot assume a well behaving client so we have to have safeguards on the server.

            green Oleg Drokin added a comment - Here's why I think LU-15190 is wrong and why the crashdump does not demonstrate a problem: 1. The duplicate xid finding code was added as part of LU-793 to find a request that is already being processed and connect the new incoming resend to it so the client will get a reply into the new buffer when the current processing ends. If we have zero requests being processed currently, then there's nothing to reconnect so lookign for a duplicate is a waste of effort. 2. Sure, if there's any RPC being processed and we get a resend we will still iterate over the list  and it might be slow if it's long, but iterating every time does not make it better for sure. I think we might want to add extra checks to exclude certain servics like canceld from this check at all but that does not address the item #1 that if there's nothing in processing then there's no need to match the xid.   4. We cannot assume a well behaving client so we have to have safeguards on the server.

            I disagree that LU-15190 is a root cause:
            1) LU-15190 is supposed to fix the real problem as demonstrated in the crash dump.
            2) the original check for exp_rpc_count is an optimization which doesn't work when any RPC on a given export is being processed.
            3) you say that enormous number of cancels were sent, so if we can't handle them instantly all those cancels will be on the list anyway and we will have to scan for dups anyway.
            4) I think throttling on the client side is what we really need.

            bzzz Alex Zhuravlev added a comment - I disagree that LU-15190 is a root cause: 1) LU-15190 is supposed to fix the real problem as demonstrated in the crash dump. 2) the original check for exp_rpc_count is an optimization which doesn't work when any RPC on a given export is being processed. 3) you say that enormous number of cancels were sent, so if we can't handle them instantly all those cancels will be on the list anyway and we will have to scan for dups anyway. 4) I think throttling on the client side is what we really need.
            green Oleg Drokin added a comment -

            I guess server-side we can just introduce a counter in the list iteration and once it reaches some max RIF value - stop iterating.

            All the requests that would benefit from such iteration are bound by max requests in flight so there could not be too many of them and as such continutity won't matter if the count is exceeded. This is probably better than just hardcoding a particular request id.

            Alternatively we can just have a flag in service portal that would instruct intake code to not run this check where it does not matter (e.g. definitely for canceld threads).

            Additionally it looks like this problem is introduced by LU-15190? Alex even demonstrated the case of duplicate xids in that ticket with ldlm cancel requests (opc 103) which don't need any of this handling. This code is mostly for the case of "long running request causes client timeout and resend the request, and we need to eply into the new request once we are finally done (LU-793), if we have duplicate xid but neither of them started processing - that's totally fine, ptlrpc_check_req will notice old request from old connection and will drop it on the floor, so no problem here. bzzz tappro - please verify my understanding here?

             

            Now client side I don't see any easy non-invasive ways to drop certain resends across recovery boundary.

            green Oleg Drokin added a comment - I guess server-side we can just introduce a counter in the list iteration and once it reaches some max RIF value - stop iterating. All the requests that would benefit from such iteration are bound by max requests in flight so there could not be too many of them and as such continutity won't matter if the count is exceeded. This is probably better than just hardcoding a particular request id. Alternatively we can just have a flag in service portal that would instruct intake code to not run this check where it does not matter (e.g. definitely for canceld threads). Additionally it looks like this problem is introduced by LU-15190 ? Alex even demonstrated the case of duplicate xids in that ticket with ldlm cancel requests (opc 103) which don't need any of this handling. This code is mostly for the case of "long running request causes client timeout and resend the request, and we need to eply into the new request once we are finally done ( LU-793 ), if we have duplicate xid but neither of them started processing - that's totally fine, ptlrpc_check_req will notice old request from old connection and will drop it on the floor, so no problem here. bzzz tappro - please verify my understanding here?   Now client side I don't see any easy non-invasive ways to drop certain resends across recovery boundary.

            People

              green Oleg Drokin
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              27 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: