It seems that if there are many uncommitted RPCs on the client when the server fails, they may end up sending a very large number of RPCs to the server during recovery replay/resend. This can cause the MDS/OSS to run out of memory because the many RPCs in the incoming request queue grows too large, as seen in
LU-9372. This can happen with very fast MDS/OSS nodes with large journals that can process a large number of requests before the journal has committed.
The patch https://review.whamcloud.com/31622 "
LU-9372 ptlrpc: fix req_buffers_max and req_history_max setting" added the req_buffers_max parameter to limit the number of RPCs in the incoming request queue (excess RPCs will be dropped by the server until some of the existing RPCs are processed).
However, that parameter is off/unlimited by default, as it isn't obvious how to set it on a particular system (it depends on the number of clients, their max_rpcs_in_flight, and the server RAM size). Also, if a subset of clients consume all of the spots in the request queue during recovery, then it is possible that other clients with uncommitted RPCs cannot get any of their RPCs into the queue, and this may cause recovery to fail due to missing sequence numbers.
Instead, it makes sense for clients to limit the number of RPCs that they send to the server during recovery, so that the MDS/OSS doesn't get overwhelmed by unprocessed RPCs. As long as each client has at least one RPC in flight to the target, then this will ensure that recovery can complete properly. This may slightly slow down recovery, but is much better than limiting the number of uncommitted RPCs at the server side during normal opeerations, since that could force extra journal commits and slow down RPC processing.
My suggestion would be to limit clients to "min(max_rpcs_in_flight, 8)" RPCs in flight during recovery, which is enough to avoid most of the RPC round-trip latency during recovery, but should not overwhelm the server (since it needs to handle this many RPCs in flight anyway). In the analysis of
LU-9372, it showed up to 1M RPCs pending on the OSS during recovery of 5000 clients, about 2000 RPCs/client, which is far too many even if there are multiple OSTs per OSS.
Even with this in place, it also makes sense for the OSS to avoid clients overwhelming it during recovery. There should should be a separate patch to default req_buffers_max to be limited by the OSS RAM size, so that the server doesn't OOM if there are older clients that do not limit their RPCs during recovery, or too many clients for some reason, even if this means recovery may not finish correctly (though this is very unlikely). A reasonable default limit would be something like ((cfs_totalram_pages() / 1048576). For the reported cases, this would be easily large enough to allow recovery (max 60k or 90k RPCs for 60GB or 90GB RAM, for 2000 and 5000 clients respectively), without overwhelming the OSS (1 RPC per 1MB of RAM).