Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13600

limit number of RPCs in flight during recovery

Details

    • 9223372036854775807

    Description

      It seems that if there are many uncommitted RPCs on the client when the server fails, they may end up sending a very large number of RPCs to the server during recovery replay/resend. This can cause the MDS/OSS to run out of memory because the many RPCs in the incoming request queue grows too large, as seen in LU-9372. This can happen with very fast MDS/OSS nodes with large journals that can process a large number of requests before the journal has committed.

      The patch https://review.whamcloud.com/31622 "LU-9372 ptlrpc: fix req_buffers_max and req_history_max setting" added the req_buffers_max parameter to limit the number of RPCs in the incoming request queue (excess RPCs will be dropped by the server until some of the existing RPCs are processed).

      However, that parameter is off/unlimited by default, as it isn't obvious how to set it on a particular system (it depends on the number of clients, their max_rpcs_in_flight, and the server RAM size). Also, if a subset of clients consume all of the spots in the request queue during recovery, then it is possible that other clients with uncommitted RPCs cannot get any of their RPCs into the queue, and this may cause recovery to fail due to missing sequence numbers.

      Instead, it makes sense for clients to limit the number of RPCs that they send to the server during recovery, so that the MDS/OSS doesn't get overwhelmed by unprocessed RPCs. As long as each client has at least one RPC in flight to the target, then this will ensure that recovery can complete properly. This may slightly slow down recovery, but is much better than limiting the number of uncommitted RPCs at the server side during normal opeerations, since that could force extra journal commits and slow down RPC processing.

      My suggestion would be to limit clients to "min(max_rpcs_in_flight, 8)" RPCs in flight during recovery, which is enough to avoid most of the RPC round-trip latency during recovery, but should not overwhelm the server (since it needs to handle this many RPCs in flight anyway). In the analysis of LU-9372, it showed up to 1M RPCs pending on the OSS during recovery of 5000 clients, about 2000 RPCs/client, which is far too many even if there are multiple OSTs per OSS.

      Even with this in place, it also makes sense for the OSS to avoid clients overwhelming it during recovery. There should should be a separate patch to default req_buffers_max to be limited by the OSS RAM size, so that the server doesn't OOM if there are older clients that do not limit their RPCs during recovery, or too many clients for some reason, even if this means recovery may not finish correctly (though this is very unlikely). A reasonable default limit would be something like ((cfs_totalram_pages() / 1048576). For the reported cases, this would be easily large enough to allow recovery (max 60k or 90k RPCs for 60GB or 90GB RAM, for 2000 and 5000 clients respectively), without overwhelming the OSS (1 RPC per 1MB of RAM).

      Attachments

        Issue Links

          Activity

            [LU-13600] limit number of RPCs in flight during recovery

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39111/
            Subject: LU-13600 ptlrpc: limit rate of lock replays
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 6b6d9c0911e45a9f38c1fdedfbb91293bd21cfb5

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39111/ Subject: LU-13600 ptlrpc: limit rate of lock replays Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 6b6d9c0911e45a9f38c1fdedfbb91293bd21cfb5

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39140/
            Subject: LU-13600 ptlrpc: re-enterable signal_completed_replay()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 24451f379050373cb05ad1df7dd19134f21abba7

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39140/ Subject: LU-13600 ptlrpc: re-enterable signal_completed_replay() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 24451f379050373cb05ad1df7dd19134f21abba7

            Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39140
            Subject: LU-13600 ptlrpc: re-enterable signal_completed_replay()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: fedadd783f9b1b113ec27c40d1c6f87e0ebce9aa

            gerrit Gerrit Updater added a comment - Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39140 Subject: LU-13600 ptlrpc: re-enterable signal_completed_replay() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: fedadd783f9b1b113ec27c40d1c6f87e0ebce9aa
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39111
            Subject: LU-13600 ptlrpc: limit rate of lock replays
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 115c8b69d3ff2dfb3b3843c21a7752e5e0034c91

            gerrit Gerrit Updater added a comment - Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39111 Subject: LU-13600 ptlrpc: limit rate of lock replays Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 115c8b69d3ff2dfb3b3843c21a7752e5e0034c91

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38920/
            Subject: LU-13600 ptlrpc: limit rate of lock replays
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 3b613a442b8698596096b23ce82e157c158a5874

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38920/ Subject: LU-13600 ptlrpc: limit rate of lock replays Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3b613a442b8698596096b23ce82e157c158a5874

            Hi Mikhail,

            We should be able to test that patch and at the same time we have also advise to use ldlm.namespaces.*.lru_size=1000 on client node.

             

            jpeyrard Johann Peyrard (Inactive) added a comment - Hi Mikhail, We should be able to test that patch and at the same time we have also advise to use ldlm.namespaces.*.lru_size=1000 on client node.  

            Johann, yes, that is the most rational explanation at the moment and supplied patch should decrease that pressure. I'd appreciate If you could check how patch changes that pattern.

            tappro Mikhail Pershin added a comment - Johann, yes, that is the most rational explanation at the moment and supplied patch should decrease that pressure. I'd appreciate If you could check how patch changes that pattern.

            Hi Mikhail and all,

            Working on multiple case like this one in the past I have seen a common pattern between this case.

            I can say if we mount one OST at a time, we see the memory increase when the recovery came into play, and this memory is free when the recovery is finished, it's a constant and monotonic increase.

            So for example you can have 10G of memory used on the OSS, and you enter recovery which will increase the memory like 1G per second of similar to reach 20G of used memory. Then at the end of the recovery, you go back to 10G of used memory.

            Here it's just an example, but it was the pattern I always see.

            And I have seen OST which can take around 50G of memory on the OSS.

            That seem weird for me that the "memory used" on the OSS looking at "free -g" increase and it's free at the end of the recovery.

            So I would think this OOM in recovery issue may be related to a management of the memory on the recovery and maybe the number of locks involved in the recovery.

            On every cluster I have seen this issue, there was at least 1500 client mounting the FS.

            jpeyrard Johann Peyrard (Inactive) added a comment - - edited Hi Mikhail and all, Working on multiple case like this one in the past I have seen a common pattern between this case. I can say if we mount one OST at a time, we see the memory increase when the recovery came into play, and this memory is free when the recovery is finished, it's a constant and monotonic increase. So for example you can have 10G of memory used on the OSS, and you enter recovery which will increase the memory like 1G per second of similar to reach 20G of used memory. Then at the end of the recovery, you go back to 10G of used memory. Here it's just an example, but it was the pattern I always see. And I have seen OST which can take around 50G of memory on the OSS. That seem weird for me that the "memory used" on the OSS looking at "free -g" increase and it's free at the end of the recovery. So I would think this OOM in recovery issue may be related to a management of the memory on the recovery and maybe the number of locks involved in the recovery. On every cluster I have seen this issue, there was at least 1500 client mounting the FS.

            It could be also that unstable network causes replays to be resent, in that case they will be also staying in recovery queue on server until processed. That means each replayed lock may have not single but several requests in queue waiting for processing. That could explain why this issue was seen recently at couple sites - in both cases there were network errors and OOM issue disappeared when network was stabilised.

            tappro Mikhail Pershin added a comment - It could be also that unstable network causes replays to be resent, in that case they will be also staying in recovery queue on server until processed. That means each replayed lock may have not single but several requests in queue waiting for processing. That could explain why this issue was seen recently at couple sites - in both cases there were network errors and OOM issue disappeared when network was stabilised.

            People

              tappro Mikhail Pershin
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: