As I understand it, this ticket is related to the problems being found in ticket LU-7054 and others. My interpretation is this:
- a large number of clients are "aggressively" firing RPCs at the OSSs. By aggressive, I mean there is a lot of parallelism from each client potentially maximizing the number of peer credits being used from the client side.
- the OSSs are struggling with the load specifically in the area of TX buffer management/allocation.
- when running low on memory, or having fragmented memory, the OSSs can freeze during memory allocation calls for TX buffers.
- these freezes create evictions.
Please confirm is my understanding is incorrect.
So, two things need to be tuned for:
1- Resources on the OSSs to make it easier to accommodate the load.
2- Traffic shaping from the clients to create back pressure on the application should it get too aggressive in accessing the file system.
The traffic shaping is managed by "lowering" the peer credit value on the clients. This reduces the number of outstanding messages to any given OSS from any given client. However, you can also change the max_rpc_in_flight parameter (Lustre, not LNet, parameter) to management how many operations can be outstanding at any given time. This is the better parameter to change as, unlike peer_credits, it does not have to be the same on two peers communicating.
Lowering max_rpc_in_flight keeps the door open for increasing peer_credits. You may want to do this on the OSSs so they are not holding back sending out responses. But, as that parameter currently needs to be the same on all nodes, you would need to increase it on the clients as well as the OSSs. max_rpc_in_flight will make sure the clients don't make use of the higher peer_credits value but the OSSs do.
FMR will probably not help much here. I looked at the code for FMR and do not see it using any less memory than regular buffers. In fact, it may use a little more. FMR helps out when using Truescale IB cards and when dealing with high latency networks (like a WAN). If using Mellanox over a LAN, you should not see much benefit from FMR.
With regards to the resources on the OSSs, memory seems to be the key one here. The TX pool allocation system returns the pools back to the system after 300 seconds. To avoid this, it is good to allocate a very large initial TX pool by setting a very high NTX value. This initial pool is never returned back to the system so having a large pool means we don't needs to spend any time in memory allocation/deallocation routines. Of course, to have a large TX pool also means having a lot of physical memory in the OSSs so they can accommodate so much pre-allocated buffers.
So, in summary, I am recommending:
- Increase the NTX value on the servers
- Increase the peer_credits on all systems
- Reduce the max_rpc_in_flight on the clients to ensure they do not get "too aggressive"
Thanks Mahmoud.
~ jfc.