[LU-13600] limit number of RPCs in flight during recovery Created: 25/May/20  Updated: 09/Dec/20  Resolved: 19/Jun/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0, Lustre 2.12.6

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: LTS12

Issue Links:
Duplicate
Related
is related to LU-9372 OOM happens on OSS during Lustre reco... Resolved
is related to LU-14027 Client recovery statemachine hangs in... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

It seems that if there are many uncommitted RPCs on the client when the server fails, they may end up sending a very large number of RPCs to the server during recovery replay/resend. This can cause the MDS/OSS to run out of memory because the many RPCs in the incoming request queue grows too large, as seen in LU-9372. This can happen with very fast MDS/OSS nodes with large journals that can process a large number of requests before the journal has committed.

The patch https://review.whamcloud.com/31622 "LU-9372 ptlrpc: fix req_buffers_max and req_history_max setting" added the req_buffers_max parameter to limit the number of RPCs in the incoming request queue (excess RPCs will be dropped by the server until some of the existing RPCs are processed).

However, that parameter is off/unlimited by default, as it isn't obvious how to set it on a particular system (it depends on the number of clients, their max_rpcs_in_flight, and the server RAM size). Also, if a subset of clients consume all of the spots in the request queue during recovery, then it is possible that other clients with uncommitted RPCs cannot get any of their RPCs into the queue, and this may cause recovery to fail due to missing sequence numbers.

Instead, it makes sense for clients to limit the number of RPCs that they send to the server during recovery, so that the MDS/OSS doesn't get overwhelmed by unprocessed RPCs. As long as each client has at least one RPC in flight to the target, then this will ensure that recovery can complete properly. This may slightly slow down recovery, but is much better than limiting the number of uncommitted RPCs at the server side during normal opeerations, since that could force extra journal commits and slow down RPC processing.

My suggestion would be to limit clients to "min(max_rpcs_in_flight, 8)" RPCs in flight during recovery, which is enough to avoid most of the RPC round-trip latency during recovery, but should not overwhelm the server (since it needs to handle this many RPCs in flight anyway). In the analysis of LU-9372, it showed up to 1M RPCs pending on the OSS during recovery of 5000 clients, about 2000 RPCs/client, which is far too many even if there are multiple OSTs per OSS.

Even with this in place, it also makes sense for the OSS to avoid clients overwhelming it during recovery. There should should be a separate patch to default req_buffers_max to be limited by the OSS RAM size, so that the server doesn't OOM if there are older clients that do not limit their RPCs during recovery, or too many clients for some reason, even if this means recovery may not finish correctly (though this is very unlikely). A reasonable default limit would be something like ((cfs_totalram_pages() / 1048576). For the reported cases, this would be easily large enough to allow recovery (max 60k or 90k RPCs for 60GB or 90GB RAM, for 2000 and 5000 clients respectively), without overwhelming the OSS (1 RPC per 1MB of RAM).



 Comments   
Comment by Andreas Dilger [ 25/May/20 ]

Mike, could you please take a look at this.

I think the patch to limit the server req_buffers_max should be easily done in one patch. Since this is only the default, and could be changed at runtime, it doesn't have to be perfect. IMHO, this would still be a lot better than the current OOM problem during recovery. It has to be large enough to avoid problems during normal processing, but small enough to avoid OOM.

The client-side limit may or may not be easy, but I haven't looked into the details. In theory the client should only have 1-2 RPCs in flight during recovery, or maybe the recovery code doesn't check max_rpcs_in_flight? It might be that the 1-2 RPCs in flight during recovery is how many the server has processed from the request queue so that it can get a contiguous sequence of RPCs to process, but the clients eagerly try to send all of their uncommitted requests so that they are available to the server for processing?

Comment by Mikhail Pershin [ 26/May/20 ]

Andreas, yes, considering symptoms, that looks like server accepts much more requests than can process and all of them are waiting in processing queue consuming memory. We know for sure that server is processing recovery requests one-by-one, no concurrent execution, at the same time ptlrpc_replay_next has no in-flight requests control as I can see. It accounts outgoing replays in imp_replay_inflight but just to know when it will become empty. That can be good point to add per-client control.

Comment by Andreas Dilger [ 08/Jun/20 ]

Mike, any chance you could make a patch to try and limit the client-side outgoing RPCs in flight? A reasonable limit would be min(8, max_rpcs_in_flight).

Comment by Mikhail Pershin [ 08/Jun/20 ]

yes, sure

Comment by Mikhail Pershin [ 09/Jun/20 ]

Andreas, on closer inspection it seems client still sends replays one-by-one, the ptlrpc_import_recovery_state_machine() choose request to replay and is called again from replay interpreter, so next replay is sent when reply for the previous one is received. Therefore either this is broken somehow or server had OOM even with single replay per client. Can that be just due to OST write replay specifics and several OSTs per node?

Comment by Mikhail Pershin [ 09/Jun/20 ]

Well, I think while previous comment is true, there is still one place where requests are sent without rate control - ldlm_replay_locks(). All locks are replayed at once and this looks like the only real place where clients can overwhelm server with bunch of RPCs

Comment by Andreas Dilger [ 10/Jun/20 ]

Definitely there have been several reports with servers having millions of outstanding RPCs that cause OOM. I don't know if there are logs in one of these tickets that could show the type of RPC being sent.

Comment by Mikhail Pershin [ 10/Jun/20 ]

Speaking about lock, they don't look as source of problem, all of them were on server before recovery as well. So that is interesting, maybe this is related to RESENT requests somehow. I will check logs in tickets

Comment by Andreas Dilger [ 10/Jun/20 ]

i agree that it may not be the locks themselves, but the RPCs enqueued for replaying the locks that are causing problems? It may be that the RPC size is larger than the size of the lock itself.

Comment by Chris Hunter (Inactive) [ 11/Jun/20 ]

Would a NRS policy help with this issue ?

Comment by Mikhail Pershin [ 11/Jun/20 ]

From the latest logs:

ldlm_lib.c:1639:abort_lock_replay_queue()) Skipped 4416881 previous similar messages

There are 1661 clients and 4M locks in replay queue, so it seems that is the source of problem as we discussed.

Comment by Andreas Dilger [ 11/Jun/20 ]

Chris, no an NRS policy will not help this case, because NRS depends on processing all of the RPCs on the server, and this problem is that there are too many RPCs arriving.

Comment by Andreas Dilger [ 11/Jun/20 ]

There are 1661 clients and 4M locks in replay queue, so it seems that is the source of problem as we discussed.

This also makes sense because clients have a limited number of outstanding RPCs to replay, but they may have thousands of locks each.

As a workaround to avoid this, the compute nodes could limit the DLM LRU size so that they don't have so many locks to replay. Something like:

lctl set_param ldlm.namespaces.\*.lru_size=1000

That would keep the maximum number of locks per MDT/OST to 1.6M for 1600 clientswhich is 1/3 of the current number.

Comment by Gerrit Updater [ 12/Jun/20 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38920
Subject: LU-13600 ptlrpc: limit rate of lock replays
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: dc1bcf18d4d115d9b1bdbe0eacfa6d39f95a307a

Comment by Mikhail Pershin [ 12/Jun/20 ]

I've made patch to limit lock replay rate from client. Meanwhile I still wonder why do we start seeing that effect only recently. That can be just result of specific tasks on clients, etc. But it is suspicious that several different sites reported the same problem almost at the same time. There can be other issues, e.g. something causes too many same locks on client or similar. 

Comment by Mikhail Pershin [ 17/Jun/20 ]

It could be also that unstable network causes replays to be resent, in that case they will be also staying in recovery queue on server until processed. That means each replayed lock may have not single but several requests in queue waiting for processing. That could explain why this issue was seen recently at couple sites - in both cases there were network errors and OOM issue disappeared when network was stabilised.

Comment by Johann Peyrard (Inactive) [ 19/Jun/20 ]

Hi Mikhail and all,

Working on multiple case like this one in the past I have seen a common pattern between this case.

I can say if we mount one OST at a time, we see the memory increase when the recovery came into play, and this memory is free when the recovery is finished, it's a constant and monotonic increase.

So for example you can have 10G of memory used on the OSS, and you enter recovery which will increase the memory like 1G per second of similar to reach 20G of used memory. Then at the end of the recovery, you go back to 10G of used memory.

Here it's just an example, but it was the pattern I always see.

And I have seen OST which can take around 50G of memory on the OSS.

That seem weird for me that the "memory used" on the OSS looking at "free -g" increase and it's free at the end of the recovery.

So I would think this OOM in recovery issue may be related to a management of the memory on the recovery and maybe the number of locks involved in the recovery.

On every cluster I have seen this issue, there was at least 1500 client mounting the FS.

Comment by Mikhail Pershin [ 19/Jun/20 ]

Johann, yes, that is the most rational explanation at the moment and supplied patch should decrease that pressure. I'd appreciate If you could check how patch changes that pattern.

Comment by Johann Peyrard (Inactive) [ 19/Jun/20 ]

Hi Mikhail,

We should be able to test that patch and at the same time we have also advise to use ldlm.namespaces.*.lru_size=1000 on client node.

 

Comment by Gerrit Updater [ 19/Jun/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38920/
Subject: LU-13600 ptlrpc: limit rate of lock replays
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3b613a442b8698596096b23ce82e157c158a5874

Comment by Gerrit Updater [ 19/Jun/20 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39111
Subject: LU-13600 ptlrpc: limit rate of lock replays
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 115c8b69d3ff2dfb3b3843c21a7752e5e0034c91

Comment by Peter Jones [ 19/Jun/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 22/Jun/20 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39140
Subject: LU-13600 ptlrpc: re-enterable signal_completed_replay()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fedadd783f9b1b113ec27c40d1c6f87e0ebce9aa

Comment by Gerrit Updater [ 03/Jul/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39140/
Subject: LU-13600 ptlrpc: re-enterable signal_completed_replay()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 24451f379050373cb05ad1df7dd19134f21abba7

Comment by Gerrit Updater [ 11/Jul/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39111/
Subject: LU-13600 ptlrpc: limit rate of lock replays
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 6b6d9c0911e45a9f38c1fdedfbb91293bd21cfb5

Generated at Sat Feb 10 03:02:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.