[LU-7224] Tunning ko2iblnd for Large clustre - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Done
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None
Environment:
lustre 2.5.3

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We have large cluster
10K nodes on main subnet
2K nodes behind 12 routers on second subnet
100 OSSess with 5 filesystem

We need your best recommendations for ko2iblnd tunning such as peer_credits, credits, ntx, etc.

Attachments

Activity

[LU-7224] Tunning ko2iblnd for Large clustre

John Fuchs-Chesney (Inactive) added a comment - 16/Apr/16 1:59 AM

Thanks Mahmoud.
~ jfc.

John Fuchs-Chesney (Inactive) added a comment - 16/Apr/16 1:59 AM Thanks Mahmoud. ~ jfc.

Mahmoud Hanafi added a comment - 11/Apr/16 8:37 PM

We can close this issue.

Mahmoud Hanafi added a comment - 11/Apr/16 8:37 PM We can close this issue.

Doug Oucharek (Inactive) added a comment - 03/Nov/15 12:26 AM

As I understand it, this ticket is related to the problems being found in ticket ~~LU-7054~~ and others. My interpretation is this:

a large number of clients are "aggressively" firing RPCs at the OSSs. By aggressive, I mean there is a lot of parallelism from each client potentially maximizing the number of peer credits being used from the client side.
the OSSs are struggling with the load specifically in the area of TX buffer management/allocation.
when running low on memory, or having fragmented memory, the OSSs can freeze during memory allocation calls for TX buffers.
these freezes create evictions.

Please confirm is my understanding is incorrect.

So, two things need to be tuned for:

1- Resources on the OSSs to make it easier to accommodate the load.
2- Traffic shaping from the clients to create back pressure on the application should it get too aggressive in accessing the file system.

The traffic shaping is managed by "lowering" the peer credit value on the clients. This reduces the number of outstanding messages to any given OSS from any given client. However, you can also change the max_rpc_in_flight parameter (Lustre, not LNet, parameter) to management how many operations can be outstanding at any given time. This is the better parameter to change as, unlike peer_credits, it does not have to be the same on two peers communicating.

Lowering max_rpc_in_flight keeps the door open for increasing peer_credits. You may want to do this on the OSSs so they are not holding back sending out responses. But, as that parameter currently needs to be the same on all nodes, you would need to increase it on the clients as well as the OSSs. max_rpc_in_flight will make sure the clients don't make use of the higher peer_credits value but the OSSs do.

FMR will probably not help much here. I looked at the code for FMR and do not see it using any less memory than regular buffers. In fact, it may use a little more. FMR helps out when using Truescale IB cards and when dealing with high latency networks (like a WAN). If using Mellanox over a LAN, you should not see much benefit from FMR.

With regards to the resources on the OSSs, memory seems to be the key one here. The TX pool allocation system returns the pools back to the system after 300 seconds. To avoid this, it is good to allocate a very large initial TX pool by setting a very high NTX value. This initial pool is never returned back to the system so having a large pool means we don't needs to spend any time in memory allocation/deallocation routines. Of course, to have a large TX pool also means having a lot of physical memory in the OSSs so they can accommodate so much pre-allocated buffers.

So, in summary, I am recommending:

Increase the NTX value on the servers
Increase the peer_credits on all systems
Reduce the max_rpc_in_flight on the clients to ensure they do not get "too aggressive"

Doug Oucharek (Inactive) added a comment - 03/Nov/15 12:26 AM As I understand it, this ticket is related to the problems being found in ticket LU-7054 and others. My interpretation is this: a large number of clients are "aggressively" firing RPCs at the OSSs. By aggressive, I mean there is a lot of parallelism from each client potentially maximizing the number of peer credits being used from the client side. the OSSs are struggling with the load specifically in the area of TX buffer management/allocation. when running low on memory, or having fragmented memory, the OSSs can freeze during memory allocation calls for TX buffers. these freezes create evictions. Please confirm is my understanding is incorrect. So, two things need to be tuned for: 1- Resources on the OSSs to make it easier to accommodate the load. 2- Traffic shaping from the clients to create back pressure on the application should it get too aggressive in accessing the file system. The traffic shaping is managed by "lowering" the peer credit value on the clients. This reduces the number of outstanding messages to any given OSS from any given client. However, you can also change the max_rpc_in_flight parameter (Lustre, not LNet, parameter) to management how many operations can be outstanding at any given time. This is the better parameter to change as, unlike peer_credits, it does not have to be the same on two peers communicating. Lowering max_rpc_in_flight keeps the door open for increasing peer_credits. You may want to do this on the OSSs so they are not holding back sending out responses. But, as that parameter currently needs to be the same on all nodes, you would need to increase it on the clients as well as the OSSs. max_rpc_in_flight will make sure the clients don't make use of the higher peer_credits value but the OSSs do. FMR will probably not help much here. I looked at the code for FMR and do not see it using any less memory than regular buffers. In fact, it may use a little more. FMR helps out when using Truescale IB cards and when dealing with high latency networks (like a WAN). If using Mellanox over a LAN, you should not see much benefit from FMR. With regards to the resources on the OSSs, memory seems to be the key one here. The TX pool allocation system returns the pools back to the system after 300 seconds. To avoid this, it is good to allocate a very large initial TX pool by setting a very high NTX value. This initial pool is never returned back to the system so having a large pool means we don't needs to spend any time in memory allocation/deallocation routines. Of course, to have a large TX pool also means having a lot of physical memory in the OSSs so they can accommodate so much pre-allocated buffers. So, in summary, I am recommending: Increase the NTX value on the servers Increase the peer_credits on all systems Reduce the max_rpc_in_flight on the clients to ensure they do not get "too aggressive"

Peter Jones added a comment - 29/Sep/15 5:17 PM

Hi Mahmoud

There is definitely an art to getting these values set appropriately. The basic effects of the settings are detailed in the ops manual but I understand that there can be surprising results when these settings are made in relation to each other. I will ask around to see what experiential knowledge people can share that might help you in this process

Peter

Peter Jones added a comment - 29/Sep/15 5:17 PM Hi Mahmoud There is definitely an art to getting these values set appropriately. The basic effects of the settings are detailed in the ops manual but I understand that there can be surprising results when these settings are made in relation to each other. I will ask around to see what experiential knowledge people can share that might help you in this process Peter

Tunning ko2iblnd for Large clustre

Details

Description

Attachments

Activity

People

Dates