[LU-17303] mdt-aspls3-MDT0003 has many threads stuck in ldlm_completion_ast "client-side enqueue returned a blocked locksleeping" Created: 21/Nov/23 Updated: 20/Jan/24 Resolved: 17/Jan/24 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Gian-Carlo Defazio | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
server (asp4) lustre-2.14.0_21.llnl-5.t4.x86_64= |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
mdt-aspls3-MDT0003 is stuck and not responding to clients. It has many (~244) threads stuck in ldlm_completion_ast, stopping and starting lustre does not fix the problem. |
| Comments |
| Comment by Gian-Carlo Defazio [ 21/Nov/23 ] |
|
For my notes, the local ticket is TOSS-6176 |
| Comment by Peter Jones [ 21/Nov/23 ] |
|
Oleg will assist with this |
| Comment by Oleg Drokin [ 21/Nov/23 ] |
|
On the webex call several things were observed?
both of these problems were addressed by removing the dead routers from teh list and disconnecting "Too new" part of the cluster. Yet it still could be observed that there's a client eviction that is then followed by the whole MDT locking up and not processing requests even as no mass evictiojns are taking places. A crashdump was collected for further analysis. Looking at the crashdump I see that While it's unclear what causes the initial eviction at this time that leads to the missed reprocess, addressing the reprocess should help to keep the system operational at least. I would recommend applying https://review.whamcloud.com/c/fs/lustre-release/+/42031 . This is a server only patch.
|
| Comment by Cameron Harr [ 22/Nov/23 ] |
|
Oleg, I loaded the patched Lustre and it was working well for probably 10-15 minutes before things got stuck again. I've uploaded a new dump at ftp.llnl.gov:/outgoing/harr1/LLNL-asp4-crashdump.2023-11-21-16:13:27.tgz |
| Comment by Cameron Harr [ 22/Nov/23 ] |
|
I'm available for another WebEx tomorrow if that is useful. |
| Comment by Cameron Harr [ 22/Nov/23 ] |
|
After applying the patch to the Lustre servers on "asp" and then rebooting orelic, our LNet router cluster, back into 2.12 from 2.15, we seemed to have user I/O working better. We still got a lot of noise on the server consoles, with LDLM timeouts and evictions though, so we still need figure out how to clear that up. We also got LBUGs on 6 nodes in one of the client clusters. I've attached the trace from the first of those nodes, ruby1066. |
| Comment by Oleg Drokin [ 22/Nov/23 ] |
|
in the crash you produced I see the lnet errors again now: [ 2140.698463] LNetError: 77875:0:(o2iblnd_cb.c:2954:kiblnd_rejected()) 172.19.2.23@o2ib100 rejected: o2iblnd fatal error I guess this is an artifact of 2.15 on the servers from yesterday? If you can post updated dmesg logs from the servers (or I guess we can have a webex) that would be helpful, but still what you describe looks like a progressive network layer problem somewhere. The crash you poste dshould have been fixed by this patch: https://review.whamcloud.com/c/fs/lustre-release/+/40052 but it was landed in time for 2.14 spo I think you should have it already, can you please doublecheck? |
| Comment by Cameron Harr [ 22/Nov/23 ] |
|
@oleg, I've attached dmesg files from each of the servers. Note the MDS servers are asp[1-4] while [5-12] are OSS nodes. |
| Comment by Oleg Drokin [ 22/Nov/23 ] |
|
LNet: 34887:0:(o2iblnd_cb.c:3390:kiblnd_check_conns() Timed out tx for 172.19.2.25@o2ib100: 55 seconds
This message indicated that ib side was not able to get a message on the wire for some sort of internal timeout + 55 second, a big sign of a network error (or sometimes on mellanox ib there are stuck qpairs issue, not sure if you are affected by any of that though) On lustre this is then seen as a network error trying to send. With just this in the picture everything is disrupted enough to not deliver RPCs around for smooth operations. ASTs are not delivered to clients so locks are not released on time and so on. Possibly related are this kind of messages: It either indicates that other node is dead, or it cannto get any messages through to the server. There's a this possibility for this last message - if you have filesystems with increased obd_timeout value, then the clients that mount all such systems are confused about what value should be used and uses the longer one while the servers on a lower value expect a quicker turnaround and this leads to a disaster, so something worth doublechecking as well just in case (there's a workaroudn patch for that in modern releases) more lnet errors that I don't know what they mean, but I am sure nothing good: LNetError: 477:0:(o2iblnd_cb.c:2954:kiblnd_rejected()) 172.19.2.23@o2ib100 rejected: o2iblnd fatal error LNetError: 477:0:(lib-move.c:3756:lnet_handle_recovery_reply()) peer NI (172.19.2.23@o2ib100) recovery failed with -111 |
| Comment by Serguei Smirnov [ 22/Nov/23 ] |
|
I don't see symptoms of the "stuck QP" issue among the errors I browsed through. "o2iblnd fatal error" could be the result of trying to connect to an IB-enabled node which is not running LNet. There are other RDMA timeout errors which make it look like nodes are running out of peer credits. I think it would make sense to check the configuration of routers on the path: lnetctl net show -v 4 lnetctl routing show cat /sys/kernel/debug/lnet/peers Please also let me know how many local peers are expected to be handled by a router: perhaps there's reason to increase numbers of routing buffers as well as credits. Thanks, Serguei. |
| Comment by Olaf Faaland [ 22/Nov/23 ] |
|
Hi, Our clients are running 2.12.9 based stack, with not many patches on top. I see https://review.whamcloud.com/c/fs/lustre-release/+/40052 is against master. Can you push to b2_12, even if you don't plan to land it, so we get your backport and it goes through test? I realize it may be a trivial backport, I ask because Eric is new and can't evaluate on his own yet. Thanks |
| Comment by Olaf Faaland [ 22/Nov/23 ] |
|
Oops, looks like I did that some time in the past Is my backport correct/sufficient? thanks |
| Comment by Oleg Drokin [ 23/Nov/23 ] |
|
yes, the backport seems to be correct |
| Comment by Cameron Harr [ 23/Nov/23 ] |
|
Serguei, Sorry for the delay, I've updated the output from the client cluster routers (ruby) and the router cluster routers (orelic). Let me know if you need anything else. |
| Comment by Serguei Smirnov [ 23/Nov/23 ] |
|
Hi Cameron, Before getting into router buffer number calculations, one thing which stands out in the provided outputs is the mismatch of o2iblnd tunables values on o2ib100 NIDs of ruby and orelic: 32/16/64 vs 8/4/8 for peer_credits/peer_credits_hiw/concurrent_sends. When they talk to each other, they will try to negotiate down to 8/4 for peer_credits/peer_credits_hiw which may actually end up being 8/7 on ruby. I'd recommend making sure these match, 32/16/64 are the recommended settings unless there's reason to throttle the flow in your case by using lower values. Are you using 8/4/8 on orelic to match the TCP NIDs settings? Thanks, Serguei. |
| Comment by Cameron Harr [ 27/Nov/23 ] |
|
Serguei, Thanks for the feedback and sorry for the slow response. In answer to your question, on orelic, it is running the TOSS 3 OS, based on RHEL 7 and we are using default lnet credit settings with the exception of: ko2iblnd credits=1024 ksocklnd credits=512 Additional Lnet-related tunings on orelic are:
lustre_common.conf:options libcfs libcfs_panic_on_lbug=1
lustre_common.conf:options libcfs libcfs_debug=0x3060580
lustre_common.conf:options ptlrpc at_min=45
lustre_common.conf:options ptlrpc at_max=600
lustre_common.conf:options ksocklnd keepalive_count=100
lustre_common.conf:options ksocklnd keepalive_idle=30
lustre_common.conf:options lnet check_routers_before_use=1
lustre_common.conf:options lnet lnet_peer_discovery_disabled=1
lustre_common.lustre212.conf:options lnet lnet_retry_count=0
lustre_common.lustre212.conf:options lnet lnet_health_sensitivity=0
lustre_router.conf:options lnet forwarding="enabled"
lustre_router.conf:options lnet tiny_router_buffers=2048
lustre_router.conf:options lnet small_router_buffers=16384
lustre_router.conf:options lnet large_router_buffers=2048
Ruby, and nearly every other system in the center is running the TOSS 4 OS, based on RHEL 8, and they also have extra tunings in addition to those two above. The routers on ruby are setting the following, which appears to take effect for both IB and OPA interfaces: ko2iblnd-opa peer_credits=32 peer_credits_hiw=16 credits=1024 concurrent_sends=64 ntx=2048 map_on_demand=256 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4 ko2iblnd credits=1024 lustre_router.conf:options ksocklnd credits=512 In case it's helpful, the other Lnet-related settings on Ruby routers are:
lustre_common.conf:options libcfs libcfs_panic_on_lbug=1
lustre_common.conf:options libcfs libcfs_debug=0x3060580
lustre_common.conf:options ptlrpc at_min=45
lustre_common.conf:options ptlrpc at_max=600
lustre_common.conf:options ksocklnd keepalive_count=100
lustre_common.conf:options ksocklnd keepalive_idle=30
lustre_common.conf:options lnet check_routers_before_use=1
lustre_common.conf:options lnet lnet_health_sensitivity=0
lustre_common.conf:options lnet lnet_peer_discovery_disabled=1
lustre_router.conf:options lnet forwarding="enabled"
lustre_router.conf:options lnet tiny_router_buffers=2048
lustre_router.conf:options lnet small_router_buffers=16384
lustre_router.conf:options lnet large_router_buffers=2048
|
| Comment by Serguei Smirnov [ 28/Nov/23 ] |
|
Cameron, My question is specifically about o2ib100 LNet. Ruby has two IB NIDs, one on o2ib100 and another on o2ib39. I'm guessing one of them is MLNX and another one is OPA? (You can only have one set of o2ib tunables and it looks like OPA set is used for both NIDs.) Orelic has a tcp NID and an IB NID on o2ib100 and its o2ib100 tunings are not matching the settings seen on ruby's o2ib100 NID (when comparing "lnetctl net show -v 4" outputs) So the question is, can you remember a specific reason for orelic using lower o2ib peer_credit settings? If there's no such reason, I'd recommend matching ruby's settings on orelic. Which version is orelic running? Does it make use of socklnd conns_per_peer? Thanks, Serguei
|
| Comment by Cameron Harr [ 28/Nov/23 ] |
|
Sergei, both clients and orelic routers are running 2.12.9. Orelic is running an older OS and configuration management system and using the default peer credits value, whereas on clusters like Ruby that are on the newer OS, they are setting peer credits higher. I don’t know of any reasons why we couldn’t raise credits on orelic and will look into changing those values today. |
| Comment by Serguei Smirnov [ 28/Nov/23 ] |
|
Cameron, I don't think 2.12.9 has socklnd conns_per_peer feature so orelics won't have it either. However, there's likely still room for experiment with peer_credits and routing buffer numbers. In case you are willing to experiment, I recently put together some guidelines for LNet routing tuning: https://wiki.whamcloud.com/display/LNet/LNet+Routing+Setup+Verification+and+Tuning This page has some suggestions for what the "peer_credits" should be, "credits" as a function of "peer_credits" and number of local peers, and similar suggestions for router buffer numbers. Thanks, Serguei
|
| Comment by Cameron Harr [ 28/Nov/23 ] |
|
Thanks for the tips Serguei. Based on what you wrote, I did a little math and came up with some suggestions. On each orelic router, I have around 2200 peers (based on `lnetctl peer show`). Of those, only 34 peers are on tcp while the remaining are on various o2ib networks. Following my understanding of your recommendations, I would make the following changes on orelic:
Do those bold numbers look sane? Should I set all 3 buffers to 65536? Note these orelic routers have 256 GB of RAM. Thanks! Cameron |
| Comment by Serguei Smirnov [ 28/Nov/23 ] |
|
Cameron, Some clarifications:
Thanks, Serguei |
| Comment by Cameron Harr [ 28/Nov/23 ] |
|
Thank you for clarifying this is only for local peers. That drastically reduces the number, with 53 local peers on o2ib100 and the same 34 on tcp0. I dropped the credits number accordingly. The revised numbers would then be:
By my calculations, that's only about 16GB of RAM. Since we have ~256GB on the node, is there a benefit to bumping up all these buffers or could that cause latency issues as the buffers wait to be cleared? |
| Comment by Serguei Smirnov [ 28/Nov/23 ] |
|
Yes, if the buffer numbers appear to be large enough, there's no need to change them. Let's hope peer_credits/credits changes you are making can improve performance. |
| Comment by Cameron Harr [ 28/Nov/23 ] |
|
Just a follow-up on this. It turns out zrelic had triple the number of local peers on o2ib100 (151), so I've further increased some of the numbers listed above to accommodate. New settings for credits and router buffers are below in red:
|
| Comment by Serguei Smirnov [ 29/Nov/23 ] |
|
Weren't orelic tcp "credits" at 512? |
| Comment by Cameron Harr [ 29/Nov/23 ] |
|
Yes, they were at 512, per the explicit setting in the modprobe file. Since that was close to what the math of peer_credits * peers came out to be, I only adjusted it slightly due to more peers on zrelic. If you think I would benefit from increasing those TCP credits up higher (like to 1024), I'm happy to do so, but there just aren't many clients in the tcp0 network. |
| Comment by Serguei Smirnov [ 29/Nov/23 ] |
|
I think 640 for orelic tcp credits is justified in your case, I was just confused by "up from 8" bit in your previous comment, but that was probably just copy-paste thing if you can confirm "credits" were indeed at 512 before as "lnetctl net show -v 4" output indicated. So I think it makes sense to test with the new settings you came up with. |
| Comment by Cameron Harr [ 29/Nov/23 ] |
|
Since setting those values yesterday, at least one of the nodes (asp4) that was logging lots of errors, cleaned up immediately and has remained clean since. |
| Comment by Serguei Smirnov [ 30/Nov/23 ] |
|
Cameron, are you planning to run any performance tests to compare "before" and "after"? |
| Comment by Cameron Harr [ 30/Nov/23 ] |
|
I'm not, though that would be interesting to see what difference there would be. My main purpose here was to clean up the Lustre fabric and connection issues, and things do look significantly cleaner since implementing those changes. |
| Comment by Serguei Smirnov [ 30/Nov/23 ] |
|
If you find out you want to achieve performance improvements, you may want to consider upgrading to version supporting socklnd conns_per_peer and/or check if socklnd nscheds increase is helpful. |
| Comment by Cameron Harr [ 30/Nov/23 ] |
|
Yes, we've been trying to move to 2.15 for over a year but have kept running into complex LNet issues with orelic and other routers. Talking internally this morning, we want to try 2.15 again in case these new tunings resolve the problems we were seeing. |
| Comment by Cameron Harr [ 30/Nov/23 ] |
|
Serguei, Thank you very much for helping us nail down these LNet tunings. It's something we've wanted to do for a long time but had a hard time finding straightforward documentation on how to do so. |
| Comment by Gian-Carlo Defazio [ 17/Jan/24 ] |
|
We haven't seen this issue since applying the suggested lnet tunables to our relic clusters. |