[LU-16283] o2iblnd.c:3049:kiblnd_shutdown() <NID>: waiting for <N> peers to disconnect Created: 01/Nov/22 Updated: 29/Jan/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Olaf Faaland | Assignee: | Serguei Smirnov |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
4.18.0-372.32.1.1toss.t4.x86_64 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Observed on a lustre router node, while the servers and some of the clients were up and connected. The luster router node has Omnipath on the client side and IB on the lustre server side. lnetctl lnet unconfigure hangs with stack [<0>] kiblnd_shutdown+0x347/0x4e0 [ko2iblnd] [<0>] lnet_shutdown_lndni+0x2b6/0x4c0 [lnet] [<0>] lnet_shutdown_lndnet+0x6c/0xb0 [lnet] [<0>] lnet_shutdown_lndnets+0x11e/0x300 [lnet] [<0>] LNetNIFini+0xb7/0x130 [lnet] [<0>] lnet_ioctl+0x220/0x260 [lnet] [<0>] notifier_call_chain+0x47/0x70 [<0>] blocking_notifier_call_chain+0x42/0x60 [<0>] libcfs_psdev_ioctl+0x346/0x590 [libcfs] [<0>] do_vfs_ioctl+0xa5/0x740 [<0>] ksys_ioctl+0x64/0xa0 [<0>] __x64_sys_ioctl+0x16/0x20 [<0>] do_syscall_64+0x5b/0x1b0 [<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6 Debug log shows it's waiting for 3 peers, even after 3700 seconds: 00000800:00000200:1.0:1667256015.359743:0:35023:0:(o2iblnd.c:3049:kiblnd_shutdown()) 172.19.1.108@o2ib100: waiting for 3 peers to disconnect ... 00000800:00000200:3.0:1667259799.039743:0:35023:0:(o2iblnd.c:3049:kiblnd_shutdown()) 172.19.1.108@o2ib100: waiting for 3 peers to disconnect Before the shutdown there were 38 peers, all reported as "up" For patch stack, see https://github.com/LLNL/lustre/releases/tag/2.15.1_7.llnl For my reference, my local ticket is TOSS5826 |
| Comments |
| Comment by Olaf Faaland [ 01/Nov/22 ] |
|
I can provide the crash dump of mutt4 if helpful. Just let me know how to send it. |
| Comment by Peter Jones [ 01/Nov/22 ] |
|
Olaf I think that uploading the crash dump to the Whamcloud ftp site is the best option.
$ ncftp ftp.whamcloud.com
NcFTP 3.2.2 (Sep 04, 2008) by Mike Gleason (http://www.NcFTP.com/contact/).
Connecting to 99.96.190.235...
(vsFTPd 2.2.2)
Logging in...
Login successful.
Logged in to ftp.whamcloud.com.
ncftp / > cd uploads
Directory successfully changed.
ncftp /uploads > mkdir AB-1234
ncftp /uploads > cd AB-1234
Directory successfully changed.
ncftp /uploads/AB-1234 > put file or *
Please note that this is a WRITE-ONLY FTP service, so you will not be able to see (with ls) the files or directories you've created, Serguei Could you please investigate? Thanks Peter |
| Comment by Olaf Faaland [ 01/Nov/22 ] |
|
crash dumps, kernel modules, and debug packages sent. Thanks |
| Comment by Serguei Smirnov [ 01/Nov/22 ] |
|
Hi Olaf, After quick initial examination it looks like llnl branch could benefit from
I haven't yet finished looking at the dumps you provided, so will update later. Thanks, Serguei. |
| Comment by Olaf Faaland [ 01/Nov/22 ] |
|
Thanks, Serguei. I don't know why this would be the case, but I've only noticed this on router nodes so far. |
| Comment by Serguei Smirnov [ 02/Nov/22 ] |
|
Hi Olaf, Here are a couple more patches that fix problems which may be leading to the same symptom: https://review.whamcloud.com/#/c/fs/lustre-release/+/46711/ https://review.whamcloud.com/41937 The latter is a rollback of On my side, I wasn't able to build 2.15.1_7.llnl because of missing something needed for KFI and not being able to disable building KFI, so I ran some tests with vanilla 2.15. I didn't reproduce "indefinite hang" yet: after a few minutes (I think it is related to CM-level timeouts in IB) all peers get disconnected for me. How easy is it to reproduce the issue in your lab? Do you need to do anything special to get it to happen (e.g. pull cables, particular order of shutting down nodes)? Thanks, Serguei.
|
| Comment by Olaf Faaland [ 07/Nov/22 ] |
|
Hi Serguei, I'm happy to try adding those patches to our stack. I've updated the description with a little more information, but essentially attempting to stop LNet while the clients and servers remained up, so I could experiment with patch stacks led to this. No pulling of cables, powering off nodes, etc. You should have been able to just skip the KFI build and therefore not need its dependencies. I'll check the spec file and configure check, thanks for letting me know. |
| Comment by Serguei Smirnov [ 08/Nov/22 ] |
|
Olaf, I'm still figuring out a reliable reproducer. I wasn't testing using routing setup yet. What are the sysctl parameter settings on your system: sysctl -a | grep arp_filter sysctl -a | grep arp_ignore sysctl -a | grep arp_announce sysctl -a | grep rp_filter Are you using settings recommended here: https://wiki.whamcloud.com/display/LNet/MR+Cluster+Setup ? The reason I'm asking is because on one occasion I did manage to reproduce the issue with o2iblnd peers failing to disconnect on shutdown, but then decided to check these settings in my system and adjusted some because they didn't match the recommended values. Since then I wasn't able to reproduce the issue. I'll have to do more tests in order to be confident that the issue cannot be reproduced with the recommended settings, but thought we could just check what these are in your case. Thanks, Serguei. |
| Comment by Olaf Faaland [ 08/Nov/22 ] |
|
Hi Serguei, Here are the sysctls you asked for, for the two interfaces used by LNet - hsi0 is the internal OPA fabric and san0 is the external fabric with the Lustre servers. I can send you the values for all the interfaces (there are many!) if you need them. net.ipv4.conf.hsi0.arp_filter = 0 net.ipv4.conf.san0.arp_filter = 0 net.ipv4.conf.hsi0.arp_ignore = 0 net.ipv4.conf.san0.arp_ignore = 0 net.ipv4.conf.hsi0.arp_announce = 0 net.ipv4.conf.san0.arp_announce = 0 net.ipv4.conf.hsi0.rp_filter = 1 net.ipv4.conf.san0.rp_filter = 1 I need to read through the confluence page, but a quick scan makes me think our settings are not consistent with those recommendations. thanks, |
| Comment by Serguei Smirnov [ 09/Nov/22 ] |
|
Hi Olaf, I don't need to see sysctl settings for all interfaces, only for those involved, but I would like to see "all" and "default" settings as these may affect which value gets used, depending on the parameter. For example, max of {all, interface} values gets used for rp_filter. In any case, this is probably most important only in MR case when there are multiple interfaces on the same lnet. So far the reproducer I found for my local system is this:
The messages in the debug log on the server at the time when it gets stuck look similar to this: 00000800:00020000:1.0:1667958274.439619:0:1967799:0:(o2iblnd_cb.c:2490:kiblnd_passive_connect()) Can't accept conn from 10.1.0.50@o2ib on NA (ib1:1:10.1.0.21): bad dst nid 10.1.0.21@o2ib 00000800:00020000:7.0:1667958279.437379:0:1983222:0:(o2iblnd_cb.c:2490:kiblnd_passive_connect()) Can't accept conn from 10.1.0.50@o2ib on NA (ib1:1:10.1.0.21): bad dst nid 10.1.0.21@o2ib 00000800:00020000:7.0:1667958279.450225:0:1983222:0:(o2iblnd_cb.c:2490:kiblnd_passive_connect()) Can't accept conn from 10.1.0.50@o2ib on NA (ib1:1:10.1.0.21): bad dst nid 10.1.0.21@o2ib 00000800:00000200:4.0:1667958310.280478:0:2280634:0:(o2iblnd.c:3049:kiblnd_shutdown()) 10.1.0.21@o2ib: waiting for 1 peers to disconnect 00000800:00000100:4.0:1667958316.514548:0:1758823:0:(o2iblnd.c:2530:kiblnd_set_ni_fatal_on()) Fatal device error for NI 10.1.0.21@o2ib 00000800:00000200:3.0F:1667958571.400477:0:2280634:0:(o2iblnd.c:3049:kiblnd_shutdown()) 10.1.0.21@o2ib: waiting for 1 peers to disconnect 00000800:00000200:2.0:1667959094.664490:0:2280634:0:(o2iblnd.c:3049:kiblnd_shutdown()) 10.1.0.21@o2ib: waiting for 1 peers to disconnect 00000800:00000200:5.0F:1667960142.216481:0:2280634:0:(o2iblnd.c:3049:kiblnd_shutdown()) 10.1.0.21@o2ib: waiting for 1 peers to disconnect "Fatal device error" reported against the remaining connected ib interface can be ignored as it is caused by shutting down the IB switch. I tested with and without "rollback of Thanks, Serguei.
|
| Comment by Olaf Faaland [ 10/Nov/22 ] |
|
Hi Serguei, Herre are the rest of the sysctls: net.ipv4.conf.all.arp_announce = 0 net.ipv4.conf.all.arp_filter = 1 net.ipv4.conf.all.arp_ignore = 0 net.ipv4.conf.all.rp_filter = 1 net.ipv4.conf.default.arp_announce = 0 net.ipv4.conf.default.arp_filter = 0 net.ipv4.conf.default.arp_ignore = 0 net.ipv4.conf.default.rp_filter = 1 In my case, I have only one LNet NI per network. Each router node has 2 OPA links (called hsi[01], one not configured in LNet) and one IB link (called san0). In case it helps: [root@mutt4:~]# lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: o2ib44
local NI(s):
- nid: 192.168.128.4@o2ib44
status: up
interfaces:
0: hsi0
- net type: o2ib100
local NI(s):
- nid: 172.19.1.108@o2ib100
status: up
interfaces:
0: san0
[root@mutt4:~]# ibstat | grep -w -e CA -e State -e Physical -e Firmware
CA 'hfi1_0'
CA type:
Firmware version: 1.27.0
State: Active
Physical state: LinkUp
CA 'hfi1_1'
CA type:
Firmware version: 1.27.0
State: Active
Physical state: LinkUp
CA 'mlx5_0'
CA type: MT4123
Firmware version: 20.32.2004
State: Active
Physical state: LinkUp
CA 'mlx5_bond_0'
CA type: MT4125
Firmware version: 22.32.2004
State: Active
Physical state: LinkUp
|
| Comment by Serguei Smirnov [ 24/Nov/22 ] |
|
Hi Olaf, On my local setup, using b2_15 and the steps-to-reproduce from the earlier comment, it appears that https://review.whamcloud.com/46711 is able to fix the issue with getting stuck on shutdown. On the other hand, on master branch checking out the commit immediately before this fix causes the issue to appear. Even though my reproducer is different, I think it is a good candidate to try in your environment. Thanks, Serguei. |
| Comment by Olaf Faaland [ 29/Nov/22 ] |
|
Thanks, Serguei. I hope to test it tomorrow. |
| Comment by Olaf Faaland [ 01/Dec/22 ] |
|
Hi Serguei, I performed a test, with https://review.whamcloud.com/46711, and still see "waiting for 1 peers to disconnect". My reproducer: Just to be sure, note that we are not using MR. thanks, |
| Comment by Serguei Smirnov [ 02/Dec/22 ] |
|
Hi Olaf, It looks like I'm able to reproduce the issue using similar setup. I was using two routers, routing between ib and tcp networks, and lnet_selftest to generate traffic between the ib server and the tcp client. I should be able to use this to look further into fixing this properly. In the meantime though, I experimented with executing "lnetctl set routing 0" on the router node before running "lustre_rmmod" on it, which seems to prevent it from getting stuck. I wonder if you can give this extra step a try to see if it helps in your case, too, as a kind of temporary workaround. Thanks, Serguei. |
| Comment by Olaf Faaland [ 05/Dec/22 ] |
|
> I experimented with executing "lnetctl set routing 0" on the router node Good idea. Doing this before "lnetctl net unconfigure" prevents the hang in kiblnd_shutdown(), thanks. |
| Comment by Olaf Faaland [ 21/Dec/22 ] |
|
Hi Serguei, Do you have any update on this issue? Thanks |
| Comment by Serguei Smirnov [ 21/Dec/22 ] |
|
Hi Olaf, Sorry, not yet. It is not addressing the root cause, but, for the lack of better ideas, I was considering changing the shutdown procedure to include "lnetctl set routing 0", but haven't submitted the patch yet. Thanks, Serguei. |
| Comment by Olaf Faaland [ 21/Dec/22 ] |
|
Yes, we are adding "lnetctl set routing 0" to the shutdown tasks in our lnet service file after the holiday break. |
| Comment by Olaf Faaland [ 09/Jan/23 ] |
|
Hi Serguei, |
| Comment by Serguei Smirnov [ 09/Jan/23 ] |
|
Hi Olaf, I haven't been able to conclusively identify the problem yet. I believe it has to do with some sort of race on LNet shutdown, but this is kind of obvious. The workaround you applied should be good for most cases, the only scenario it doesn't cover is probably when active router NI's are being brought down/up dynamically. Thanks, Serguei. |
| Comment by Olaf Faaland [ 19/Apr/23 ] |
|
This workaround seems to be working well for us. We do not bring NIs up and down dynamically in production normally, so the potential problem scenario probably won't occur and thus won't hurt us. So I'll remove topllnl, but leave the ticket open until an actual fix is identified. |