[LU-17303] mdt-aspls3-MDT0003 has many threads stuck in ldlm_completion_ast "client-side enqueue returned a blocked locksleeping" - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
- llnl
Environment:
server (asp4) lustre-2.14.0_21.llnl-5.t4.x86_64=
clients (oslic) lustre-2.12.9_6.llnl-2.t4.x86_64, (ruby) lustre-2.12.9_7.llnl-1.t4.x86_64
TOSS 4.6-6

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

mdt-aspls3-MDT0003 is stuck and not responding to clients. It has many (~244) threads stuck in ldlm_completion_ast, stopping and starting lustre does not fix the problem.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

asp.dmesg.logs.tgz
797 kB
22/Nov/23 7:34 PM
orelic.lnet-diag.tgz
33 kB
23/Nov/23 12:38 AM
ruby.lnet-diag.tgz
101 kB
23/Nov/23 12:38 AM
ruby1066.log
33 kB
22/Nov/23 6:34 PM

Activity

[LU-17303] mdt-aspls3-MDT0003 has many threads stuck in ldlm_completion_ast "client-side enqueue returned a blocked locksleeping"

Cameron Harr added a comment - 30/Nov/23 6:00 PM

Yes, we've been trying to move to 2.15 for over a year but have kept running into complex LNet issues with orelic and other routers. Talking internally this morning, we want to try 2.15 again in case these new tunings resolve the problems we were seeing.

Cameron Harr added a comment - 30/Nov/23 6:00 PM Yes, we've been trying to move to 2.15 for over a year but have kept running into complex LNet issues with orelic and other routers. Talking internally this morning, we want to try 2.15 again in case these new tunings resolve the problems we were seeing.

Serguei Smirnov added a comment - 30/Nov/23 5:30 PM

If you find out you want to achieve performance improvements, you may want to consider upgrading to version supporting socklnd conns_per_peer and/or check if socklnd nscheds increase is helpful.

Serguei Smirnov added a comment - 30/Nov/23 5:30 PM If you find out you want to achieve performance improvements, you may want to consider upgrading to version supporting socklnd conns_per_peer and/or check if socklnd nscheds increase is helpful.

Cameron Harr added a comment - 30/Nov/23 4:04 PM

I'm not, though that would be interesting to see what difference there would be. My main purpose here was to clean up the Lustre fabric and connection issues, and things do look significantly cleaner since implementing those changes.

Cameron Harr added a comment - 30/Nov/23 4:04 PM I'm not, though that would be interesting to see what difference there would be. My main purpose here was to clean up the Lustre fabric and connection issues, and things do look significantly cleaner since implementing those changes.

Serguei Smirnov added a comment - 30/Nov/23 3:49 AM

Cameron, are you planning to run any performance tests to compare "before" and "after"?

Serguei Smirnov added a comment - 30/Nov/23 3:49 AM Cameron, are you planning to run any performance tests to compare "before" and "after"?

Cameron Harr added a comment - 29/Nov/23 5:35 PM

Since setting those values yesterday, at least one of the nodes (asp4) that was logging lots of errors, cleaned up immediately and has remained clean since.

Cameron Harr added a comment - 29/Nov/23 5:35 PM Since setting those values yesterday, at least one of the nodes (asp4) that was logging lots of errors, cleaned up immediately and has remained clean since.

Serguei Smirnov added a comment - 29/Nov/23 12:51 AM

I think 640 for orelic tcp credits is justified in your case, I was just confused by "up from 8" bit in your previous comment, but that was probably just copy-paste thing if you can confirm "credits" were indeed at 512 before as "lnetctl net show -v 4" output indicated.

So I think it makes sense to test with the new settings you came up with.

Serguei Smirnov added a comment - 29/Nov/23 12:51 AM I think 640 for orelic tcp credits is justified in your case, I was just confused by "up from 8" bit in your previous comment, but that was probably just copy-paste thing if you can confirm "credits" were indeed at 512 before as "lnetctl net show -v 4" output indicated. So I think it makes sense to test with the new settings you came up with.

Cameron Harr added a comment - 29/Nov/23 12:40 AM

Yes, they were at 512, per the explicit setting in the modprobe file. Since that was close to what the math of peer_credits * peers came out to be, I only adjusted it slightly due to more peers on zrelic. If you think I would benefit from increasing those TCP credits up higher (like to 1024), I'm happy to do so, but there just aren't many clients in the tcp0 network.

Cameron Harr added a comment - 29/Nov/23 12:40 AM Yes, they were at 512, per the explicit setting in the modprobe file. Since that was close to what the math of peer_credits * peers came out to be, I only adjusted it slightly due to more peers on zrelic. If you think I would benefit from increasing those TCP credits up higher (like to 1024), I'm happy to do so, but there just aren't many clients in the tcp0 network.

Serguei Smirnov added a comment - 29/Nov/23 12:28 AM - edited

Weren't orelic tcp "credits" at 512?

Serguei Smirnov added a comment - 29/Nov/23 12:28 AM - edited Weren't orelic tcp "credits" at 512?

Cameron Harr added a comment - 28/Nov/23 11:38 PM - edited

Just a follow-up on this. It turns out zrelic had triple the number of local peers on o2ib100 (151), so I've further increased some of the numbers listed above to accommodate. New settings for credits and router buffers are below in red:

o2iblnd
- peer_credits: 32 (up from 8)
- credits: 4096 (up from 1024)
- conns_per_peer: leave at 1 (MLNX std)
- concurrent_sends: 64 (up from 8)
- peercredits_hiw: 16 (up from 4)
tcp (200 Gb link)
- peer_credits: 16 (up from 8, increase due to high b/w of network)
- credits: 640 (16 * 34=544; up from 8)
Router buffers
- (151 * 32) + (41 * 16) = 5488. I'm doubling buffer settings to 4096/32768/4096.

Cameron Harr added a comment - 28/Nov/23 11:38 PM - edited Just a follow-up on this. It turns out zrelic had triple the number of local peers on o2ib100 (151), so I've further increased some of the numbers listed above to accommodate. New settings for credits and router buffers are below in red: o2iblnd peer_credits: 32 (up from 8) credits: 4096 (up from 1024) conns_per_peer: leave at 1 (MLNX std) concurrent_sends: 64 (up from 8) peercredits_hiw: 16 (up from 4) tcp (200 Gb link) peer_credits: 16 (up from 8, increase due to high b/w of network) credits: 640 (16 * 34=544; up from 8) Router buffers (151 * 32) + (41 * 16) = 5488. I'm doubling buffer settings to 4096/32768/4096 .

Serguei Smirnov added a comment - 28/Nov/23 10:52 PM

Yes, if the buffer numbers appear to be large enough, there's no need to change them. Let's hope peer_credits/credits changes you are making can improve performance.

Serguei Smirnov added a comment - 28/Nov/23 10:52 PM Yes, if the buffer numbers appear to be large enough, there's no need to change them. Let's hope peer_credits/credits changes you are making can improve performance.

Cameron Harr added a comment - 28/Nov/23 10:32 PM

Thank you for clarifying this is only for local peers. That drastically reduces the number, with 53 local peers on o2ib100 and the same 34 on tcp0. I dropped the credits number accordingly. The revised numbers would then be:

o2iblnd
- peer_credits: 32 (up from 8)
- credits: 2048 (up from 1024)
- conns_per_peer: leave at 1 (MLNX std)
- concurrent_sends: 64 (up from 8)
- peercredits_hiw: 16 (up from 4)
tcp (200 Gb link)
- peer_credits: 16 (up from 8, increase due to high b/w of network)
- credits: 512 (16 * 34=544; up from 8)
Router buffers
- (53 * 32) + (34 * 16) = 2240. That makes me think we could keep the current settings of 2048/16384/2048.

By my calculations, that's only about 16GB of RAM. Since we have ~256GB on the node, is there a benefit to bumping up all these buffers or could that cause latency issues as the buffers wait to be cleared?

Cameron Harr added a comment - 28/Nov/23 10:32 PM Thank you for clarifying this is only for local peers. That drastically reduces the number, with 53 local peers on o2ib100 and the same 34 on tcp0. I dropped the credits number accordingly. The revised numbers would then be: o2iblnd peer_credits: 32 (up from 8) credits: 2048 (up from 1024) conns_per_peer: leave at 1 (MLNX std) concurrent_sends: 64 (up from 8) peercredits_hiw: 16 (up from 4) tcp (200 Gb link) peer_credits: 16 (up from 8, increase due to high b/w of network) credits: 512 (16 * 34=544; up from 8) Router buffers (53 * 32) + (34 * 16) = 2240. That makes me think we could keep the current settings of 2048/16384/2048. By my calculations, that's only about 16GB of RAM. Since we have ~256GB on the node, is there a benefit to bumping up all these buffers or could that cause latency issues as the buffers wait to be cleared?

People

Assignee:: Oleg Drokin

Reporter:: Gian-Carlo Defazio

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 21/Nov/23 12:11 AM

Updated:: 20/Jan/24 12:51 AM

Resolved:: 17/Jan/24 10:13 PM