It isn't clear what the benefit of a tunable to limit the number of RQBDs is, if it is off by default, since most users will not even know it exists. Even if users do know this tunable exists, there probably isn't an easy way to know a good default value for the maximum number of rqbd buffers, since that depends greatly based on RAM size on the server, the number of clients, and the client load.
Looking at older comments here, there are several things that concern me:
- the total number of RQBDs allocated seem far more than could possibly ever be used, since clients should typically only have at most 8 RPCs in flight per OST
- during recovery, clients should normally only have a single RPC in flight per OST
This means there shouldn't be more than about 5000 clients * 20 OSTs/OSS = 100000 RPCs/OSS outstanding on the OSS (maybe 200000 RPCs if you have 40 OSTs/OSS in failover mode, or is 10 OSTs/OSS the normal config and 20 OSTs/OSS is the failover config?).
Are we sure that clients are not generating a flood of RPCs per OST during recovery (more than max_rpcs_in_flight)? I also recall there may be a race condition during RQBD allocation, that may cause it to allocate more buffers than it needed if there are many threads trying to send an RPC at the same time and the buffers run out.
On a related note, having so many OSTs on a single OSS is not really a good configuration, since it doesn't provide very good performance, and as you see it has a very high RAM requirement, and also causes a larger point of failure if the OSS goes down. In addition to the number of outstanding RPCs that the clients may send, there is also a bunch of RAM used by the ldiskfs journals.
Also, if you are seeing messages like the following in your logs:
then there is something significantly wrong with your HA or STONITH configuration. MMP is meant as a backup sanity check to prevent double import to prevent filesystem corruption (which it did successfully for 20 OSTs in this case), but it is not intended to be a primary HA exclusion method for the storage.
I was just talking about this problem, and I have found that I had never clearly indicated in this ticket that the reason of this 32k alloc for each ptlrpc_rqbd (for a real size of 17k) is due to patch for
LU-4755("LU-4755ptlrpc: enlarge OST_MAXREQSIZE for 4MB RPC").Since the way this size has been identified looks a bit empiric, we may also want to give try to 15k (+ payload size, thus leading to 16k) in order to divide the real consumed size by 2.
To allow almost the same size reduction, we may also try to use a specific kmem_cache/Slabs for ptlrpc_rqbd/17k, keeping in mind that it may be made useless due to Kernel merging of Slabs.