It isn't clear what the benefit of a tunable to limit the number of RQBDs is, if it is off by default, since most users will not even know it exists. Even if users do know this tunable exists, there probably isn't an easy way to know a good default value for the maximum number of rqbd buffers, since that depends greatly based on RAM size on the server, the number of clients, and the client load.
Looking at older comments here, there are several things that concern me:
- the total number of RQBDs allocated seem far more than could possibly ever be used, since clients should typically only have at most 8 RPCs in flight per OST
- during recovery, clients should normally only have a single RPC in flight per OST
This means there shouldn't be more than about 5000 clients * 20 OSTs/OSS = 100000 RPCs/OSS outstanding on the OSS (maybe 200000 RPCs if you have 40 OSTs/OSS in failover mode, or is 10 OSTs/OSS the normal config and 20 OSTs/OSS is the failover config?).
Are we sure that clients are not generating a flood of RPCs per OST during recovery (more than max_rpcs_in_flight)? I also recall there may be a race condition during RQBD allocation, that may cause it to allocate more buffers than it needed if there are many threads trying to send an RPC at the same time and the buffers run out.
On a related note, having so many OSTs on a single OSS is not really a good configuration, since it doesn't provide very good performance, and as you see it has a very high RAM requirement, and also causes a larger point of failure if the OSS goes down. In addition to the number of outstanding RPCs that the clients may send, there is also a bunch of RAM used by the ldiskfs journals.
Also, if you are seeing messages like the following in your logs:
then there is something significantly wrong with your HA or STONITH configuration. MMP is meant as a backup sanity check to prevent double import to prevent filesystem corruption (which it did successfully for 20 OSTs in this case), but it is not intended to be a primary HA exclusion method for the storage.
Andreas, I can't answer about your Lustre RPC and HA config/behavior concerns, but I am sure Bruno will do soon. But you may right that something should be done to avoid a peer to try mounting all targets upon restart/reboot.
Concerning the memory consumption, I can confirm that the huge number of size-1024 and size-32768 objects was very close to the current number of RQBDs (sum of scp_nrqbds_total).
About your comment concerning "a race condition during RQBD allocation, that may cause it to allocate more buffers than it needed ..." , I presume you refer to the checks/code in ptlrpc_check_rqbd_pool()/ptlrpc_grow_req_bufs() routines, where the case that a lot of RQBDs+buffers may be currently handled already. And this is where my 2 patchs try to limit the allocations.
Last, do you mean that I must add in my patch some auto-tune code (vs memory size/load, but also number of targets?, only if failover/recovery running?, ...) to the current/only possibility of manually setting a limit ?