[LU-13600] limit number of RPCs in flight during recovery Created: 25/May/20 Updated: 09/Dec/20 Resolved: 19/Jun/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.6 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LTS12 | ||
| Issue Links: |
|
||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
It seems that if there are many uncommitted RPCs on the client when the server fails, they may end up sending a very large number of RPCs to the server during recovery replay/resend. This can cause the MDS/OSS to run out of memory because the many RPCs in the incoming request queue grows too large, as seen in The patch https://review.whamcloud.com/31622 " However, that parameter is off/unlimited by default, as it isn't obvious how to set it on a particular system (it depends on the number of clients, their max_rpcs_in_flight, and the server RAM size). Also, if a subset of clients consume all of the spots in the request queue during recovery, then it is possible that other clients with uncommitted RPCs cannot get any of their RPCs into the queue, and this may cause recovery to fail due to missing sequence numbers. Instead, it makes sense for clients to limit the number of RPCs that they send to the server during recovery, so that the MDS/OSS doesn't get overwhelmed by unprocessed RPCs. As long as each client has at least one RPC in flight to the target, then this will ensure that recovery can complete properly. This may slightly slow down recovery, but is much better than limiting the number of uncommitted RPCs at the server side during normal opeerations, since that could force extra journal commits and slow down RPC processing. My suggestion would be to limit clients to "min(max_rpcs_in_flight, 8)" RPCs in flight during recovery, which is enough to avoid most of the RPC round-trip latency during recovery, but should not overwhelm the server (since it needs to handle this many RPCs in flight anyway). In the analysis of Even with this in place, it also makes sense for the OSS to avoid clients overwhelming it during recovery. There should should be a separate patch to default req_buffers_max to be limited by the OSS RAM size, so that the server doesn't OOM if there are older clients that do not limit their RPCs during recovery, or too many clients for some reason, even if this means recovery may not finish correctly (though this is very unlikely). A reasonable default limit would be something like ((cfs_totalram_pages() / 1048576). For the reported cases, this would be easily large enough to allow recovery (max 60k or 90k RPCs for 60GB or 90GB RAM, for 2000 and 5000 clients respectively), without overwhelming the OSS (1 RPC per 1MB of RAM). |
| Comments |
| Comment by Andreas Dilger [ 25/May/20 ] |
|
Mike, could you please take a look at this. I think the patch to limit the server req_buffers_max should be easily done in one patch. Since this is only the default, and could be changed at runtime, it doesn't have to be perfect. IMHO, this would still be a lot better than the current OOM problem during recovery. It has to be large enough to avoid problems during normal processing, but small enough to avoid OOM. The client-side limit may or may not be easy, but I haven't looked into the details. In theory the client should only have 1-2 RPCs in flight during recovery, or maybe the recovery code doesn't check max_rpcs_in_flight? It might be that the 1-2 RPCs in flight during recovery is how many the server has processed from the request queue so that it can get a contiguous sequence of RPCs to process, but the clients eagerly try to send all of their uncommitted requests so that they are available to the server for processing? |
| Comment by Mikhail Pershin [ 26/May/20 ] |
|
Andreas, yes, considering symptoms, that looks like server accepts much more requests than can process and all of them are waiting in processing queue consuming memory. We know for sure that server is processing recovery requests one-by-one, no concurrent execution, at the same time ptlrpc_replay_next has no in-flight requests control as I can see. It accounts outgoing replays in imp_replay_inflight but just to know when it will become empty. That can be good point to add per-client control. |
| Comment by Andreas Dilger [ 08/Jun/20 ] |
|
Mike, any chance you could make a patch to try and limit the client-side outgoing RPCs in flight? A reasonable limit would be min(8, max_rpcs_in_flight). |
| Comment by Mikhail Pershin [ 08/Jun/20 ] |
|
yes, sure |
| Comment by Mikhail Pershin [ 09/Jun/20 ] |
|
Andreas, on closer inspection it seems client still sends replays one-by-one, the ptlrpc_import_recovery_state_machine() choose request to replay and is called again from replay interpreter, so next replay is sent when reply for the previous one is received. Therefore either this is broken somehow or server had OOM even with single replay per client. Can that be just due to OST write replay specifics and several OSTs per node? |
| Comment by Mikhail Pershin [ 09/Jun/20 ] |
|
Well, I think while previous comment is true, there is still one place where requests are sent without rate control - ldlm_replay_locks(). All locks are replayed at once and this looks like the only real place where clients can overwhelm server with bunch of RPCs |
| Comment by Andreas Dilger [ 10/Jun/20 ] |
|
Definitely there have been several reports with servers having millions of outstanding RPCs that cause OOM. I don't know if there are logs in one of these tickets that could show the type of RPC being sent. |
| Comment by Mikhail Pershin [ 10/Jun/20 ] |
|
Speaking about lock, they don't look as source of problem, all of them were on server before recovery as well. So that is interesting, maybe this is related to RESENT requests somehow. I will check logs in tickets |
| Comment by Andreas Dilger [ 10/Jun/20 ] |
|
i agree that it may not be the locks themselves, but the RPCs enqueued for replaying the locks that are causing problems? It may be that the RPC size is larger than the size of the lock itself. |
| Comment by Chris Hunter (Inactive) [ 11/Jun/20 ] |
|
Would a NRS policy help with this issue ? |
| Comment by Mikhail Pershin [ 11/Jun/20 ] |
|
From the latest logs: ldlm_lib.c:1639:abort_lock_replay_queue()) Skipped 4416881 previous similar messages There are 1661 clients and 4M locks in replay queue, so it seems that is the source of problem as we discussed. |
| Comment by Andreas Dilger [ 11/Jun/20 ] |
|
Chris, no an NRS policy will not help this case, because NRS depends on processing all of the RPCs on the server, and this problem is that there are too many RPCs arriving. |
| Comment by Andreas Dilger [ 11/Jun/20 ] |
This also makes sense because clients have a limited number of outstanding RPCs to replay, but they may have thousands of locks each. As a workaround to avoid this, the compute nodes could limit the DLM LRU size so that they don't have so many locks to replay. Something like: lctl set_param ldlm.namespaces.\*.lru_size=1000 That would keep the maximum number of locks per MDT/OST to 1.6M for 1600 clientswhich is 1/3 of the current number. |
| Comment by Gerrit Updater [ 12/Jun/20 ] |
|
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38920 |
| Comment by Mikhail Pershin [ 12/Jun/20 ] |
|
I've made patch to limit lock replay rate from client. Meanwhile I still wonder why do we start seeing that effect only recently. That can be just result of specific tasks on clients, etc. But it is suspicious that several different sites reported the same problem almost at the same time. There can be other issues, e.g. something causes too many same locks on client or similar. |
| Comment by Mikhail Pershin [ 17/Jun/20 ] |
|
It could be also that unstable network causes replays to be resent, in that case they will be also staying in recovery queue on server until processed. That means each replayed lock may have not single but several requests in queue waiting for processing. That could explain why this issue was seen recently at couple sites - in both cases there were network errors and OOM issue disappeared when network was stabilised. |
| Comment by Johann Peyrard (Inactive) [ 19/Jun/20 ] |
|
Hi Mikhail and all, Working on multiple case like this one in the past I have seen a common pattern between this case. I can say if we mount one OST at a time, we see the memory increase when the recovery came into play, and this memory is free when the recovery is finished, it's a constant and monotonic increase. So for example you can have 10G of memory used on the OSS, and you enter recovery which will increase the memory like 1G per second of similar to reach 20G of used memory. Then at the end of the recovery, you go back to 10G of used memory. Here it's just an example, but it was the pattern I always see. And I have seen OST which can take around 50G of memory on the OSS. That seem weird for me that the "memory used" on the OSS looking at "free -g" increase and it's free at the end of the recovery. So I would think this OOM in recovery issue may be related to a management of the memory on the recovery and maybe the number of locks involved in the recovery. On every cluster I have seen this issue, there was at least 1500 client mounting the FS. |
| Comment by Mikhail Pershin [ 19/Jun/20 ] |
|
Johann, yes, that is the most rational explanation at the moment and supplied patch should decrease that pressure. I'd appreciate If you could check how patch changes that pattern. |
| Comment by Johann Peyrard (Inactive) [ 19/Jun/20 ] |
|
Hi Mikhail, We should be able to test that patch and at the same time we have also advise to use ldlm.namespaces.*.lru_size=1000 on client node.
|
| Comment by Gerrit Updater [ 19/Jun/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38920/ |
| Comment by Gerrit Updater [ 19/Jun/20 ] |
|
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39111 |
| Comment by Peter Jones [ 19/Jun/20 ] |
|
Landed for 2.14 |
| Comment by Gerrit Updater [ 22/Jun/20 ] |
|
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39140 |
| Comment by Gerrit Updater [ 03/Jul/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39140/ |
| Comment by Gerrit Updater [ 11/Jul/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39111/ |