Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18072

Lock cancel resending overwhelms ldlm canceld thread

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      Ever since LU-1565 landed it looks like another subtle problem was introduced.

      Since LDLM_CANCEL requests don't have any throttling, they could be sent in huge numbers, this is normally not a big problem as processing is relatively lightweight.

      But in case there's a resend for such a cancel, suddenly we take this path in ptlrpc_server_check_resend_in_progress() added in LU-793:

      ptlrpc_server_check_resend_in_progress(struct ptlrpc_request *req)
      {
              struct ptlrpc_request *tmp = NULL;        if (!(lustre_msg_get_flags(req->rq_reqmsg) & MSG_RESENT))
                      return NULL;        /*
               * This list should not be longer than max_requests in
               * flights on the client, so it is not all that long.
               * Also we only hit this codepath in case of a resent
               * request which makes it even more rarely hit
               */
              list_for_each_entry(tmp, &req->rq_export->exp_reg_rpcs,
                                      rq_exp_list) {
                      /* Found duplicate one */
                      if (tmp->rq_xid == req->rq_xid)
                              goto found;
              }
              list_for_each_entry(tmp, &req->rq_export->exp_hp_rpcs,
                                      rq_exp_list) {
                      /* Found duplicate one */
                      if (tmp->rq_xid == req->rq_xid)
      

      a case was observed at a customer site where this was entered and the exp_hp_rpc lists for multiple client exports were in tens of thousands as mass cancellations (from lru timing out?) commenced which led to these list iterations under exp_rpc_lock taking really long time and now clogging the ldlm_cn threads so no processing wasreally possible, request delays from network acceptance to req_in handler could reach into tens of seconds as the result for all clients (same thread pool) with the expected disastrous results.

       

      I imagine we need to address this from two directions:

      1. server side we really need to avoid taking long time for duplicate request search
      2. client side we already drop unused and actively cancelled ldlm locks, we need to also drop the cancel resends on replay somehow

      Attachments

        Issue Links

          Activity

            [LU-18072] Lock cancel resending overwhelms ldlm canceld thread

            "Oleg Drokin <green@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56843
            Subject: LU-18072 ptlrpc: do not search for duplicate cancel requests
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5d88c8d89c940366aebfa96b0a47c5677a24e071

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56843 Subject: LU-18072 ptlrpc: do not search for duplicate cancel requests Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5d88c8d89c940366aebfa96b0a47c5677a24e071

            Does Andriy's patch https://review.whamcloud.com/55946 "LU-18111 ptlrpc: don't drop expired cancel request" help this issue at all?

            adilger Andreas Dilger added a comment - Does Andriy's patch https://review.whamcloud.com/55946 " LU-18111 ptlrpc: don't drop expired cancel request " help this issue at all?

            I think the whole duplicate xid searching is a waste for the case or RPCs like cancel

            I tend to agree with Oleg here, but would look at it wider - why in this case to search for dups for RPCs without a tag at all? if so, it is not a matter of portals; and moreover - in this case the list becomes really small - as it was supposed to be originally, and therefore the benefit of dropping the incoming RPC queue (Alex's proposal), is very limited and probably no much sense to bother with it.

            iiuc, the main idea of this code was to search for dups is to not execute modification RPC twice, what may lead to corruptions. this would be still in place. the 2nd benefit was (theoretical, iiuc, not confirmed by tests) to not let heavy weight not-modification RPCs to be processed twice. however, in this (2nd) case, if we have already sent a reply, we are not able to detect it is a dup and would handle it twice anyway (only relying on other dropping mechanisms if applicable); if we have not sent a reply yet - all the file data are (almost always) still in mem, and the 2nd handling supposed to be relatively fast.

            whereas the original idea (of still detecting such dups for all RPCs) looked reasonable, it resulted in very long lists and completely blocked MDS's, and relatively fast 2nd handling looks much faster here than searching dup for (if already not all RPCs, but still) many other not modification RPCs.

            vitaly_fertman Vitaly Fertman added a comment - I think the whole duplicate xid searching is a waste for the case or RPCs like cancel I tend to agree with Oleg here, but would look at it wider - why in this case to search for dups for RPCs without a tag at all? if so, it is not a matter of portals; and moreover - in this case the list becomes really small - as it was supposed to be originally, and therefore the benefit of dropping the incoming RPC queue (Alex's proposal), is very limited and probably no much sense to bother with it. iiuc, the main idea of this code was to search for dups is to not execute modification RPC twice, what may lead to corruptions. this would be still in place. the 2nd benefit was (theoretical, iiuc, not confirmed by tests) to not let heavy weight not-modification RPCs to be processed twice. however, in this (2nd) case, if we have already sent a reply, we are not able to detect it is a dup and would handle it twice anyway (only relying on other dropping mechanisms if applicable); if we have not sent a reply yet - all the file data are (almost always) still in mem, and the 2nd handling supposed to be relatively fast. whereas the original idea (of still detecting such dups for all RPCs) looked reasonable, it resulted in very long lists and completely blocked MDS's, and relatively fast 2nd handling looks much faster here than searching dup for (if already not all RPCs, but still) many other not modification RPCs.
            green Oleg Drokin added a comment -

            I branched out LU-18077 for the work on the client side handling of cancel resends so we can concentrate on the server side here,

            green Oleg Drokin added a comment - I branched out LU-18077 for the work on the client side handling of cancel resends so we can concentrate on the server side here,
            green Oleg Drokin added a comment -

            I think the whole duplicate xid searching is a waste for the case or RPCs like cancel, rhashtable is unnecessary complication too.

            basically we have two kinds of portals:

            • "normal" - those are guarded by max RIF (well, I guess evil clients might exist) - the number of requests in teh queue there is small, some requests might take a long time to process so duplicate search and reconnect makes sense (LU-793)
            • "special" - like the cancel portal, unlimited requests, fast processing, no need for duplicate xid search - so we should skip that code path altogether

            moreover, just like Alex proposed - every time we receive a reconnect from the client - we should throw away entire incoming queue at once, the reason is twofold:

            1. we won't service those requests anyway so why let it hog the memory and slow down processing?
            2. increases speed of reprocessing those duplicate xids search (well, limited value I guess)
            3. the wrinkle is we need to still keep requests taht are being processed when the reconnect came.

            This does not process us against evil clients so we still need some sort of a counter of requests received and waiting for processing so when that gets too big we'd return some error I guess?

            green Oleg Drokin added a comment - I think the whole duplicate xid searching is a waste for the case or RPCs like cancel, rhashtable is unnecessary complication too. basically we have two kinds of portals: "normal" - those are guarded by max RIF (well, I guess evil clients might exist) - the number of requests in teh queue there is small, some requests might take a long time to process so duplicate search and reconnect makes sense ( LU-793 ) "special" - like the cancel portal, unlimited requests, fast processing, no need for duplicate xid search - so we should skip that code path altogether moreover, just like Alex proposed - every time we receive a reconnect from the client - we should throw away entire incoming queue at once, the reason is twofold: we won't service those requests anyway so why let it hog the memory and slow down processing? increases speed of reprocessing those duplicate xids search (well, limited value I guess) the wrinkle is we need to still keep requests taht are being processed when the reconnect came. This does not process us against evil clients so we still need some sort of a counter of requests received and waiting for processing so when that gets too big we'd return some error I guess?

            Returning -EINPROGRESS to the client if the server is OOM or the incoming queue is too large will at least allow the client to resend this RPC without returning an error to the application. However, we should also print some kind of notice to the MDS console in this case (e.g. "server request queue overloaded by N RPCs, denying new requests" or similar), otherwise we may have an unresponsive server and clients "hanging" without any visible signs in the logs.

            Using an rhashtable for the client RPC XID (per export) would be reasonable. While struct rhashtable is not tiny (176 bytes), it is still relatively small compared to struct obd_export at 1024 bytes:

            struct rhashtable {
                    /* size: 176, cachelines: 3, members: 9 */
                    /* sum members: 169, holes: 1, sum holes: 7 */
                    /* last cacheline: 48 bytes */
            };
            
            struct obd_export {
                    /* size: 1024, cachelines: 16, members: 59 */
                    /* sum members: 1000, holes: 5, sum holes: 20 */
                    /* sum bitfield members: 16 bits, bit holes: 1, sum bit holes: 16 bits */
                    /* paddings: 2, sum paddings: 7 */
                    /* forced alignments: 1 */
            

            Ideally, client lock cancellation should always be fast, but if there is something slowing it down (for whatever reason) we don't want it to become exponentially worse as the queue size grows...

            It looks like we might also use this RPC XID hash functionality in a few other places, like mdt_steal_ack_locks() which is walking the exp_outstanding_replies list looking for XIDs.

            The issue that Oleg pointed out about dropping duplicate RPC XIDs, but then skipping the RPC because it is old should be fixed separately. I think in that case it makes sense to update the RPC time to ensure it does not get dropped. However, we also want to ensure that the logs/stats show the original time that the RPC arrived and how long it was waiting, and not the most recent time that the RPC was resent...

            adilger Andreas Dilger added a comment - Returning -EINPROGRESS to the client if the server is OOM or the incoming queue is too large will at least allow the client to resend this RPC without returning an error to the application. However, we should also print some kind of notice to the MDS console in this case (e.g. " server request queue overloaded by N RPCs, denying new requests " or similar), otherwise we may have an unresponsive server and clients "hanging" without any visible signs in the logs. Using an rhashtable for the client RPC XID (per export) would be reasonable. While struct rhashtable is not tiny (176 bytes), it is still relatively small compared to struct obd_export at 1024 bytes: struct rhashtable { /* size: 176, cachelines: 3, members: 9 */ /* sum members: 169, holes: 1, sum holes: 7 */ /* last cacheline: 48 bytes */ }; struct obd_export { /* size: 1024, cachelines: 16, members: 59 */ /* sum members: 1000, holes: 5, sum holes: 20 */ /* sum bitfield members: 16 bits, bit holes: 1, sum bit holes: 16 bits */ /* paddings: 2, sum paddings: 7 */ /* forced alignments: 1 */ Ideally, client lock cancellation should always be fast, but if there is something slowing it down (for whatever reason) we don't want it to become exponentially worse as the queue size grows... It looks like we might also use this RPC XID hash functionality in a few other places, like mdt_steal_ack_locks() which is walking the exp_outstanding_replies list looking for XIDs. The issue that Oleg pointed out about dropping duplicate RPC XIDs, but then skipping the RPC because it is old should be fixed separately. I think in that case it makes sense to update the RPC time to ensure it does not get dropped. However, we also want to ensure that the logs/stats show the original time that the RPC arrived and how long it was waiting, and not the most recent time that the RPC was resent...
            green Oleg Drokin added a comment -

            just as a case in point on that second item, here are some logs demonstrating the issue:

            00000100:00080000:9.1:1721844446.001290:0:42485:0:(service.c:1665:ptlrpc_server_check_resend_in_progress()) @@@ Found duplicate req in processing  req@00000000e093c0c0 x1797962421855424/t0(0) o103->c5c2b0cd-65c2-40b4-94f3-c83728866801@xxx.27@o2ib1:505/0 lens 328/0 e 0 to 0 dl 1721845200 ref 1 fl New:/2/ffffffff rc 0/-1 job:''
            00000100:00080000:9.1:1721844446.001294:0:42485:0:(service.c:1666:ptlrpc_server_check_resend_in_progress()) @@@ Request being processed  req@00000000d5607036 x1797962421855424/t0(0) o103->c5c2b0cd-65c2-40b4-94f3-c83728866801@xxx.27@o2ib1:387/0 lens 328/0 e 0 to 0 dl 1721845082 ref 1 fl New:/2/ffffffff rc 0/-1 job:''
            00000100:00080000:3.1:1721844530.369062:0:42483:0:(service.c:1665:ptlrpc_server_check_resend_in_progress()) @@@ Found duplicate req in processing  req@00000000b8794633 x1797962421855424/t0(0) o103->c5c2b0cd-65c2-40b4-94f3-c83728866801@xxx.27@o2ib1:590/0 lens 328/0 e 0 to 0 dl 1721845285 ref 1 fl New:/2/ffffffff rc 0/-1 job:''
            00000100:00080000:3.1:1721844530.369065:0:42483:0:(service.c:1666:ptlrpc_server_check_resend_in_progress()) @@@ Request being processed  req@00000000d5607036 x1797962421855424/t0(0) o103->c5c2b0cd-65c2-40b4-94f3-c83728866801@xxx.27@o2ib1:505/0 lens 328/0 e 0 to 0 dl 1721845200 ref 1 fl New:/2/ffffffff rc 0/-1 job:''
            

            As you can see we keep getting those resends but when we actually get to process and throw them away see that the "request being processed" is always req@00000000d5607036 which I guess just happens to be the first one so we reconnect it until it is picked up and rejected

            Here's another example:

            00000100:00100000:3.0:1721844550.258114:0:12200:0:(service.c:2130:ptlrpc_server_handle_req_in()) got req x1795981875388544
            00000100:00080000:3.1:1721844550.258152:0:12200:0:(service.c:1665:ptlrpc_server_check_resend_in_progress()) @@@ Found duplicate req in processing  req@000000006f044e5c x1795981875388544/t0(0) o103->0386f931-246b-438d-aff6-a6097abf4761@xxx.22@tcp:455/0 lens 328/0 e 0 to 0 dl 1721845150 ref 1 fl New:/2/ffffffff rc 0/-1 job:''
            00000100:00080000:3.1:1721844550.258158:0:12200:0:(service.c:1666:ptlrpc_server_check_resend_in_progress()) @@@ Request being processed  req@00000000904139a0 x1795981875388544/t0(0) o103->0386f931-246b-438d-aff6-a6097abf4761@xxx.22@tcp:31/0 lens 328/0 e 0 to 0 dl 1721844726 ref 1 fl New:/2/ffffffff rc 0/-1 job:''
            00000100:00100000:8.0:1721844551.378734:0:12208:0:(service.c:2130:ptlrpc_server_handle_req_in()) got req x1795981875388544
            00000100:00100000:8.0:1721844551.378736:0:12208:0:(service.c:1210:ptlrpc_check_req()) @@@ DROPPING req from old connection 6796 < 6886  req@00000000c54961ad x1795981875388544/t0(0) o103->0386f931-246b-438d-aff6-a6097abf4761@xxx.22@tcp:0/0 lens 328/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1 job:''

            so as you can see here we are picking super old requests (90 reconnects back!) from the queue and dropping them even though for unprocessed ones we really want to pick the newest.

            green Oleg Drokin added a comment - just as a case in point on that second item, here are some logs demonstrating the issue: 00000100:00080000:9.1:1721844446.001290:0:42485:0:(service.c:1665:ptlrpc_server_check_resend_in_progress()) @@@ Found duplicate req in processing  req@00000000e093c0c0 x1797962421855424/t0(0) o103->c5c2b0cd-65c2-40b4-94f3-c83728866801@xxx.27@o2ib1:505/0 lens 328/0 e 0 to 0 dl 1721845200 ref 1 fl New:/2/ffffffff rc 0/-1 job:'' 00000100:00080000:9.1:1721844446.001294:0:42485:0:(service.c:1666:ptlrpc_server_check_resend_in_progress()) @@@ Request being processed  req@00000000d5607036 x1797962421855424/t0(0) o103->c5c2b0cd-65c2-40b4-94f3-c83728866801@xxx.27@o2ib1:387/0 lens 328/0 e 0 to 0 dl 1721845082 ref 1 fl New:/2/ffffffff rc 0/-1 job:'' 00000100:00080000:3.1:1721844530.369062:0:42483:0:(service.c:1665:ptlrpc_server_check_resend_in_progress()) @@@ Found duplicate req in processing  req@00000000b8794633 x1797962421855424/t0(0) o103->c5c2b0cd-65c2-40b4-94f3-c83728866801@xxx.27@o2ib1:590/0 lens 328/0 e 0 to 0 dl 1721845285 ref 1 fl New:/2/ffffffff rc 0/-1 job:'' 00000100:00080000:3.1:1721844530.369065:0:42483:0:(service.c:1666:ptlrpc_server_check_resend_in_progress()) @@@ Request being processed  req@00000000d5607036 x1797962421855424/t0(0) o103->c5c2b0cd-65c2-40b4-94f3-c83728866801@xxx.27@o2ib1:505/0 lens 328/0 e 0 to 0 dl 1721845200 ref 1 fl New:/2/ffffffff rc 0/-1 job:'' As you can see we keep getting those resends but when we actually get to process and throw them away see that the "request being processed" is always req@00000000d5607036 which I guess just happens to be the first one so we reconnect it until it is picked up and rejected Here's another example: 00000100:00100000:3.0:1721844550.258114:0:12200:0:(service.c:2130:ptlrpc_server_handle_req_in()) got req x1795981875388544 00000100:00080000:3.1:1721844550.258152:0:12200:0:(service.c:1665:ptlrpc_server_check_resend_in_progress()) @@@ Found duplicate req in processing  req@000000006f044e5c x1795981875388544/t0(0) o103->0386f931-246b-438d-aff6-a6097abf4761@xxx.22@tcp:455/0 lens 328/0 e 0 to 0 dl 1721845150 ref 1 fl New:/2/ffffffff rc 0/-1 job:'' 00000100:00080000:3.1:1721844550.258158:0:12200:0:(service.c:1666:ptlrpc_server_check_resend_in_progress()) @@@ Request being processed  req@00000000904139a0 x1795981875388544/t0(0) o103->0386f931-246b-438d-aff6-a6097abf4761@xxx.22@tcp:31/0 lens 328/0 e 0 to 0 dl 1721844726 ref 1 fl New:/2/ffffffff rc 0/-1 job:'' 00000100:00100000:8.0:1721844551.378734:0:12208:0:(service.c:2130:ptlrpc_server_handle_req_in()) got req x1795981875388544 00000100:00100000:8.0:1721844551.378736:0:12208:0:(service.c:1210:ptlrpc_check_req()) @@@ DROPPING req from old connection 6796 < 6886  req@00000000c54961ad x1795981875388544/t0(0) o103->0386f931-246b-438d-aff6-a6097abf4761@xxx.22@tcp:0/0 lens 328/0 e 0 to 0 dl 0 ref 1 fl New:/2/ffffffff rc 0/-1 job:'' so as you can see here we are picking super old requests (90 reconnects back!) from the queue and dropping them even though for unprocessed ones we really want to pick the newest.
            green Oleg Drokin added a comment -

            They are not processing fast because they have the resent flag, so instead of being processed, all the cn threads are hogging the cpu fighting to get the exp_rpc_lock spinlock for their chance to iterate the incoming RPCs list searching for duplicates. Available core files show that there are tens of thousands of requests on those lists from dozen or so clients. This takes a lot of CPU. While the spinlock is separate per client, I imagine those RPCs arrive in batches right after a reconnect.

            Dropping them on reconnect server-side certainly makes sense because at least we quickly trim the list, but still need sensible resent handling esp. for portals where it does nto help to reconnect to old requests like the cancel portal.

            Personally I think optimizing of duplicates search is not going to buy us as much as not running it where it makes no sense at all.

            The other thing is considering duplicates search really only makes sense for requests that are in processing, why don't we have a separate lists for such requests? that one is sure going to be super short (limited by number of server threads at the most).

             

            If we go back to LU-793 - the whole idea there was that for long running requests we want to attach resends into the existing processing so the response is sent as a reply to the resent request and is actually delivered, instead of us retryign the real expensive operation anyway.

            Doing this for requests that are not yet starting processing is counterproductive for two reasons:

            1. the not yet started request is not going to be processed anyway because due to the reconnect it will be dropped thanks to the ptlrpc_check_req() filtering
            2. But reconnecting a new resend into the old request like that all we do is ensure even the resent request is not going to be processed because all we do is just tell the client "hey this is old generation request" even though it really was not!
            green Oleg Drokin added a comment - They are not processing fast because they have the resent flag, so instead of being processed, all the cn threads are hogging the cpu fighting to get the exp_rpc_lock spinlock for their chance to iterate the incoming RPCs list searching for duplicates. Available core files show that there are tens of thousands of requests on those lists from dozen or so clients. This takes a lot of CPU. While the spinlock is separate per client, I imagine those RPCs arrive in batches right after a reconnect. Dropping them on reconnect server-side certainly makes sense because at least we quickly trim the list, but still need sensible resent handling esp. for portals where it does nto help to reconnect to old requests like the cancel portal. Personally I think optimizing of duplicates search is not going to buy us as much as not running it where it makes no sense at all. The other thing is considering duplicates search really only makes sense for requests that are in processing, why don't we have a separate lists for such requests? that one is sure going to be super short (limited by number of server threads at the most).   If we go back to LU-793 - the whole idea there was that for long running requests we want to attach resends into the existing processing so the response is sent as a reply to the resent request and is actually delivered, instead of us retryign the real expensive operation anyway. Doing this for requests that are not yet starting processing is counterproductive for two reasons: the not yet started request is not going to be processed anyway because due to the reconnect it will be dropped thanks to the ptlrpc_check_req() filtering But reconnecting a new resend into the old request like that all we do is ensure even the resent request is not going to be processed because all we do is just tell the client "hey this is old generation request" even though it really was not!
            tappro Mikhail Pershin added a comment - - edited

            then the question is why they are not processing fast but staying in incoming queue. Is that canceld overload and busy? Or that is result of NRS policy which doesn't prioritize cancels properly? Still, I think if we can't control client flow, then we need to control it on server, why to fill memory with tons incoming cancels from all clients along with all their resent duplicates? Even if they can be dropped, OOM on server is still worse than -EBUSY response for clients with too many requests waiting

            Then I'd think about moving to rhash to find duplicates - that way we would avoid duplicates - less memory usage as result and search will be in reasonable time, sort of compromise. I still have no whole picture for this, only some ideas about possible approaches 

            tappro Mikhail Pershin added a comment - - edited then the question is why they are not processing fast but staying in incoming queue. Is that canceld overload and busy? Or that is result of NRS policy which doesn't prioritize cancels properly? Still, I think if we can't control client flow, then we need to control it on server, why to fill memory with tons incoming cancels from all clients along with all their resent duplicates? Even if they can be dropped, OOM on server is still worse than -EBUSY response for clients with too many requests waiting Then I'd think about moving to rhash to find duplicates - that way we would avoid duplicates - less memory usage as result and search will be in reasonable time, sort of compromise. I still have no whole picture for this, only some ideas about possible approaches 
            green Oleg Drokin added a comment -

            plus we used to respect max-rpcs-in-flight on the clients in the past?

            ldlm cancels never had max RIF, if you have to send a lock cancel and there are no slots, what are you going to do? wait it out? that's the most sure way to get evicted for unresponsiveness there is.

            That's why they go to their own separate portal - so other requests don't affect cancellations and the other way around - cancellations don't affect the other requests

            green Oleg Drokin added a comment - plus we used to respect max-rpcs-in-flight on the clients in the past? ldlm cancels never had max RIF, if you have to send a lock cancel and there are no slots, what are you going to do? wait it out? that's the most sure way to get evicted for unresponsiveness there is. That's why they go to their own separate portal - so other requests don't affect cancellations and the other way around - cancellations don't affect the other requests

            People

              green Oleg Drokin
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

                Created:
                Updated: