Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
server (asp4) lustre-2.14.0_21.llnl-5.t4.x86_64=
clients (oslic) lustre-2.12.9_6.llnl-2.t4.x86_64, (ruby) lustre-2.12.9_7.llnl-1.t4.x86_64
TOSS 4.6-6
-
3
-
9223372036854775807
Description
mdt-aspls3-MDT0003 is stuck and not responding to clients. It has many (~244) threads stuck in ldlm_completion_ast, stopping and starting lustre does not fix the problem.
Attachments
- asp.dmesg.logs.tgz
- 797 kB
- orelic.lnet-diag.tgz
- 33 kB
- ruby.lnet-diag.tgz
- 101 kB
- ruby1066.log
- 33 kB
Activity
Just a follow-up on this. It turns out zrelic had triple the number of local peers on o2ib100 (151), so I've further increased some of the numbers listed above to accommodate. New settings for credits and router buffers are below in red:
- o2iblnd
- peer_credits: 32 (up from 8)
- credits: 4096 (up from 1024)
- conns_per_peer: leave at 1 (MLNX std)
- concurrent_sends: 64 (up from 8)
- peercredits_hiw: 16 (up from 4)
- tcp (200 Gb link)
- peer_credits: 16 (up from 8, increase due to high b/w of network)
- credits: 640 (16 * 34=544; up from 8)
- Router buffers
- (151 * 32) + (41 * 16) = 5488. I'm doubling buffer settings to 4096/32768/4096.
Yes, if the buffer numbers appear to be large enough, there's no need to change them. Let's hope peer_credits/credits changes you are making can improve performance.
Thank you for clarifying this is only for local peers. That drastically reduces the number, with 53 local peers on o2ib100 and the same 34 on tcp0. I dropped the credits number accordingly. The revised numbers would then be:
- o2iblnd
- peer_credits: 32 (up from 8)
- credits: 2048 (up from 1024)
- conns_per_peer: leave at 1 (MLNX std)
- concurrent_sends: 64 (up from 8)
- peercredits_hiw: 16 (up from 4)
- tcp (200 Gb link)
- peer_credits: 16 (up from 8, increase due to high b/w of network)
- credits: 512 (16 * 34=544; up from 8)
- Router buffers
- (53 * 32) + (34 * 16) = 2240. That makes me think we could keep the current settings of 2048/16384/2048.
By my calculations, that's only about 16GB of RAM. Since we have ~256GB on the node, is there a benefit to bumping up all these buffers or could that cause latency issues as the buffers wait to be cleared?
Cameron,
Some clarifications:
- Does orelic actually have 2200 local peers? "credits" should be calculated using the number of local peers, i.e. on the same LNet. If you have a client which is separated from the servers by a bunch of routers, then you'd multiply "peer_credits" by the number of routers to get "credits" because on the local LNet for which we're tuning, the client only talks to the routers.
- Same logic applies when calculating router buffer sizes. You multiply the number of local peers by the peer_credits separately for the LNets being routed, then add the results
- After calculating the new buffer numbers for the routers, please check if there's enough memory on the router nodes to actually create the new pools. You may need to reduce the pool sizes if there isn't enough memory.
Thanks,
Serguei
Thanks for the tips Serguei. Based on what you wrote, I did a little math and came up with some suggestions.
On each orelic router, I have around 2200 peers (based on `lnetctl peer show`). Of those, only 34 peers are on tcp while the remaining are on various o2ib networks. Following my understanding of your recommendations, I would make the following changes on orelic:
- o2iblnd
- peer_credits: 32 (up from 8)
- credits: 65536 (up from 1024)
- conns_per_peer: leave at 1 (MLNX std)
- concurrent_sends: 64 (up from 8)
- peercredits_hiw: 16 (up from 4)
- tcp (200 Gb link)
- peer_credits: 16 (up from 8, increase due to high b/w of network)
- credits: 512 (16 * 34=544; up from 8)
- Router buffers
- (# o2ib peers * o2ib peer_credits) + (# tcp peers * tcp peer credits)
- (2166 * 32) + (34 * 16) = 69856, so round down to 65536 (up from 2048/16384/2048)
Do those bold numbers look sane? Should I set all 3 buffers to 65536? Note these orelic routers have 256 GB of RAM.
Thanks!
Cameron
Cameron,
I don't think 2.12.9 has socklnd conns_per_peer feature so orelics won't have it either.
However, there's likely still room for experiment with peer_credits and routing buffer numbers. In case you are willing to experiment, I recently put together some guidelines for LNet routing tuning: https://wiki.whamcloud.com/display/LNet/LNet+Routing+Setup+Verification+and+Tuning
This page has some suggestions for what the "peer_credits" should be, "credits" as a function of "peer_credits" and number of local peers, and similar suggestions for router buffer numbers.
Thanks,
Serguei
Sergei, both clients and orelic routers are running 2.12.9. Orelic is running an older OS and configuration management system and using the default peer credits value, whereas on clusters like Ruby that are on the newer OS, they are setting peer credits higher. I don’t know of any reasons why we couldn’t raise credits on orelic and will look into changing those values today.
Cameron,
My question is specifically about o2ib100 LNet.
Ruby has two IB NIDs, one on o2ib100 and another on o2ib39. I'm guessing one of them is MLNX and another one is OPA? (You can only have one set of o2ib tunables and it looks like OPA set is used for both NIDs.)
Orelic has a tcp NID and an IB NID on o2ib100 and its o2ib100 tunings are not matching the settings seen on ruby's o2ib100 NID (when comparing "lnetctl net show -v 4" outputs)
So the question is, can you remember a specific reason for orelic using lower o2ib peer_credit settings? If there's no such reason, I'd recommend matching ruby's settings on orelic.
Which version is orelic running? Does it make use of socklnd conns_per_peer?
Thanks,
Serguei
Serguei,
Thanks for the feedback and sorry for the slow response. In answer to your question, on orelic, it is running the TOSS 3 OS, based on RHEL 7 and we are using default lnet credit settings with the exception of:
ko2iblnd credits=1024 ksocklnd credits=512
Additional Lnet-related tunings on orelic are:
lustre_common.conf:options libcfs libcfs_panic_on_lbug=1
lustre_common.conf:options libcfs libcfs_debug=0x3060580
lustre_common.conf:options ptlrpc at_min=45
lustre_common.conf:options ptlrpc at_max=600
lustre_common.conf:options ksocklnd keepalive_count=100
lustre_common.conf:options ksocklnd keepalive_idle=30
lustre_common.conf:options lnet check_routers_before_use=1
lustre_common.conf:options lnet lnet_peer_discovery_disabled=1
lustre_common.lustre212.conf:options lnet lnet_retry_count=0
lustre_common.lustre212.conf:options lnet lnet_health_sensitivity=0
lustre_router.conf:options lnet forwarding="enabled"
lustre_router.conf:options lnet tiny_router_buffers=2048
lustre_router.conf:options lnet small_router_buffers=16384
lustre_router.conf:options lnet large_router_buffers=2048
Ruby, and nearly every other system in the center is running the TOSS 4 OS, based on RHEL 8, and they also have extra tunings in addition to those two above. The routers on ruby are setting the following, which appears to take effect for both IB and OPA interfaces:
ko2iblnd-opa peer_credits=32 peer_credits_hiw=16 credits=1024 concurrent_sends=64 ntx=2048 map_on_demand=256 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4 ko2iblnd credits=1024 lustre_router.conf:options ksocklnd credits=512
In case it's helpful, the other Lnet-related settings on Ruby routers are:
lustre_common.conf:options libcfs libcfs_panic_on_lbug=1
lustre_common.conf:options libcfs libcfs_debug=0x3060580
lustre_common.conf:options ptlrpc at_min=45
lustre_common.conf:options ptlrpc at_max=600
lustre_common.conf:options ksocklnd keepalive_count=100
lustre_common.conf:options ksocklnd keepalive_idle=30
lustre_common.conf:options lnet check_routers_before_use=1
lustre_common.conf:options lnet lnet_health_sensitivity=0
lustre_common.conf:options lnet lnet_peer_discovery_disabled=1
lustre_router.conf:options lnet forwarding="enabled"
lustre_router.conf:options lnet tiny_router_buffers=2048
lustre_router.conf:options lnet small_router_buffers=16384
lustre_router.conf:options lnet large_router_buffers=2048
Yes, they were at 512, per the explicit setting in the modprobe file. Since that was close to what the math of peer_credits * peers came out to be, I only adjusted it slightly due to more peers on zrelic. If you think I would benefit from increasing those TCP credits up higher (like to 1024), I'm happy to do so, but there just aren't many clients in the tcp0 network.