[LU-16283] o2iblnd.c:3049:kiblnd_shutdown() <NID>: waiting for <N> peers to disconnect Created: 01/Nov/22  Updated: 29/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Olaf Faaland Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: None
Environment:

4.18.0-372.32.1.1toss.t4.x86_64
lustre-2.15.1_7.llnl-2.t4.x86_64


Attachments: File dk.mutt4.1.gz     File dk.mutt4.2.gz     File dk.mutt4.3.gz     File dmesg.mutt4.1667256190.gz     File dmesg.mutt4.1667259716.gz     File lnetctl.peer.show.mutt4.1.gz    
Issue Links:
Related
is related to LU-17480 lustre_rmmod hangs if a lnet route is... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Observed on a lustre router node, while the servers and some of the clients were up and connected. The luster router node has Omnipath on the client side and IB on the lustre server side.

lnetctl lnet unconfigure 

hangs with stack

[<0>] kiblnd_shutdown+0x347/0x4e0 [ko2iblnd]
[<0>] lnet_shutdown_lndni+0x2b6/0x4c0 [lnet]
[<0>] lnet_shutdown_lndnet+0x6c/0xb0 [lnet]
[<0>] lnet_shutdown_lndnets+0x11e/0x300 [lnet]
[<0>] LNetNIFini+0xb7/0x130 [lnet]
[<0>] lnet_ioctl+0x220/0x260 [lnet]
[<0>] notifier_call_chain+0x47/0x70
[<0>] blocking_notifier_call_chain+0x42/0x60
[<0>] libcfs_psdev_ioctl+0x346/0x590 [libcfs]
[<0>] do_vfs_ioctl+0xa5/0x740
[<0>] ksys_ioctl+0x64/0xa0
[<0>] __x64_sys_ioctl+0x16/0x20
[<0>] do_syscall_64+0x5b/0x1b0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6 

Debug log shows it's waiting for 3 peers, even after 3700 seconds:

00000800:00000200:1.0:1667256015.359743:0:35023:0:(o2iblnd.c:3049:kiblnd_shutdown()) 172.19.1.108@o2ib100: waiting for 3 peers to disconnect 
...
00000800:00000200:3.0:1667259799.039743:0:35023:0:(o2iblnd.c:3049:kiblnd_shutdown()) 172.19.1.108@o2ib100: waiting for 3 peers to disconnect

Before the shutdown there were 38 peers, all reported as "up"

For patch stack, see https://github.com/LLNL/lustre/releases/tag/2.15.1_7.llnl

For my reference, my local ticket is TOSS5826



 Comments   
Comment by Olaf Faaland [ 01/Nov/22 ]

I can provide the crash dump of mutt4 if helpful. Just let me know how to send it.

Comment by Peter Jones [ 01/Nov/22 ]

Olaf

I think that uploading the crash dump to the Whamcloud ftp site is the best option.

$ ncftp ftp.whamcloud.com
NcFTP 3.2.2 (Sep 04, 2008) by Mike Gleason (http://www.NcFTP.com/contact/).
Connecting to 99.96.190.235...
(vsFTPd 2.2.2)
Logging in...
Login successful.
Logged in to ftp.whamcloud.com.
ncftp / > cd uploads
Directory successfully changed.
ncftp /uploads > mkdir AB-1234
ncftp /uploads > cd AB-1234
Directory successfully changed.
ncftp /uploads/AB-1234 > put file or *

Please note that this is a WRITE-ONLY FTP service, so you will not be able to see (with ls) the files or directories you've created,
nor will you (or anyone other than Whamcloud staff) be able to see or read them.

Serguei

Could you please investigate?

Thanks

Peter

Comment by Olaf Faaland [ 01/Nov/22 ]

crash dumps, kernel modules, and debug packages sent. Thanks

Comment by Serguei Smirnov [ 01/Nov/22 ]

Hi Olaf,

After quick initial examination it looks like llnl branch could benefit from 

  • LU-14503 o2iblnd: clean up zombie connections on shutdown

I haven't yet finished looking at the dumps you provided, so will update later.

Thanks,

Serguei.

Comment by Olaf Faaland [ 01/Nov/22 ]

Thanks, Serguei. I don't know why this would be the case, but I've only noticed this on router nodes so far.

Comment by Serguei Smirnov [ 02/Nov/22 ]

Hi Olaf,

Here are a couple more patches that fix problems which may be leading to the same symptom:

https://review.whamcloud.com/#/c/fs/lustre-release/+/46711/

https://review.whamcloud.com/41937

The latter is a rollback of LU-13368 lnet: discard the callback. As Andreas pointed out, there were reports of this patch linked to issues with IB cleanup.

On my side, I wasn't able to build 2.15.1_7.llnl because of missing something needed for KFI and not being able to disable building KFI, so I ran some tests with vanilla 2.15. I didn't reproduce "indefinite hang" yet: after a few minutes (I think it is related to CM-level timeouts in IB) all peers get disconnected for me.

How easy is it to reproduce the issue in your lab? Do you need to do anything special to get it to happen (e.g. pull cables, particular order of shutting down nodes)?

Thanks,

Serguei. 

 

Comment by Olaf Faaland [ 07/Nov/22 ]

Hi Serguei,

I'm happy to try adding those patches to our stack.

I've updated the description with a little more information, but essentially attempting to stop LNet while the clients and servers remained up, so I could experiment with patch stacks led to this. No pulling of cables, powering off nodes, etc.

You should have been able to just skip the KFI build and therefore not need its dependencies. I'll check the spec file and configure check, thanks for letting me know.

Comment by Serguei Smirnov [ 08/Nov/22 ]

Olaf, 

I'm still figuring out a reliable reproducer. I wasn't testing using routing setup yet.

What are the sysctl parameter settings on your system:

sysctl -a | grep arp_filter
sysctl -a | grep arp_ignore
sysctl -a | grep arp_announce
sysctl -a | grep rp_filter

Are you using settings recommended here: https://wiki.whamcloud.com/display/LNet/MR+Cluster+Setup ?

The reason I'm asking is because on one occasion I did manage to reproduce the issue with o2iblnd peers failing to disconnect on shutdown, but then decided to check these settings in my system and adjusted some because they didn't match the recommended values. Since then I wasn't able to reproduce the issue. I'll have to do more tests in order to be confident that the issue cannot be reproduced with the recommended settings, but thought we could just check what these are in your case.

Thanks,

Serguei.

Comment by Olaf Faaland [ 08/Nov/22 ]

Hi Serguei,

Here are the sysctls you asked for, for the two interfaces used by LNet - hsi0 is the internal OPA fabric and san0 is the external fabric with the Lustre servers. I can send you the values for all the interfaces (there are many!) if you need them.

net.ipv4.conf.hsi0.arp_filter = 0
net.ipv4.conf.san0.arp_filter = 0
net.ipv4.conf.hsi0.arp_ignore = 0
net.ipv4.conf.san0.arp_ignore = 0
net.ipv4.conf.hsi0.arp_announce = 0
net.ipv4.conf.san0.arp_announce = 0
net.ipv4.conf.hsi0.rp_filter = 1
net.ipv4.conf.san0.rp_filter = 1

I need to read through the confluence page, but a quick scan makes me think our settings are not consistent with those recommendations.

thanks,
Olaf

Comment by Serguei Smirnov [ 09/Nov/22 ]

Hi Olaf,

I don't need to see sysctl settings for all interfaces, only for those involved, but I would like to see "all" and "default" settings as these may affect which value gets used, depending on the parameter. For example, max of {all, interface} values gets used for rp_filter.

In any case, this is probably most important only in MR case when there are multiple interfaces on the same lnet.

So far the reproducer I found for my local system is this:

  1. b2_15 lustre server with two interfaces on the same o2ib lnet. Runs both mds and oss.
  2. b2_15 lustre client running on a VM hosted on a different machine, configured to use a single ib interface on the same o2ib net.
  3. Mount FS on the client. Mount command lists both server nids. Verify it worked (I use ls on the mounted directory)
  4. Unmount FS on the client. 
  5. Pull cable on the first of the server nids.
  6. Mount FS on the client same way as before. Verify it works just the same.
  7. Unmount mdt and oss/ost on the server. This should succeed.
  8. Run "lustre_rmmod" on the server. This appears to get stuck indefinitely.

The messages in the debug log on the server at the time when it gets stuck look similar to this:

00000800:00020000:1.0:1667958274.439619:0:1967799:0:(o2iblnd_cb.c:2490:kiblnd_passive_connect()) Can't accept conn from 10.1.0.50@o2ib on NA (ib1:1:10.1.0.21): bad dst nid 10.1.0.21@o2ib
00000800:00020000:7.0:1667958279.437379:0:1983222:0:(o2iblnd_cb.c:2490:kiblnd_passive_connect()) Can't accept conn from 10.1.0.50@o2ib on NA (ib1:1:10.1.0.21): bad dst nid 10.1.0.21@o2ib
00000800:00020000:7.0:1667958279.450225:0:1983222:0:(o2iblnd_cb.c:2490:kiblnd_passive_connect()) Can't accept conn from 10.1.0.50@o2ib on NA (ib1:1:10.1.0.21): bad dst nid 10.1.0.21@o2ib 00000800:00000200:4.0:1667958310.280478:0:2280634:0:(o2iblnd.c:3049:kiblnd_shutdown()) 10.1.0.21@o2ib: waiting for 1 peers to disconnect 00000800:00000100:4.0:1667958316.514548:0:1758823:0:(o2iblnd.c:2530:kiblnd_set_ni_fatal_on()) Fatal device error for NI 10.1.0.21@o2ib
00000800:00000200:3.0F:1667958571.400477:0:2280634:0:(o2iblnd.c:3049:kiblnd_shutdown()) 10.1.0.21@o2ib: waiting for 1 peers to disconnect
00000800:00000200:2.0:1667959094.664490:0:2280634:0:(o2iblnd.c:3049:kiblnd_shutdown()) 10.1.0.21@o2ib: waiting for 1 peers to disconnect
00000800:00000200:5.0F:1667960142.216481:0:2280634:0:(o2iblnd.c:3049:kiblnd_shutdown()) 10.1.0.21@o2ib: waiting for 1 peers to disconnect

"Fatal device error" reported against the remaining connected ib interface can be ignored as it is caused by shutting down the IB switch.

I tested with and without "rollback of LU-13368 lnet: discard the callback" on the server to the same effect. Haven't tried with latest master - planning to do that next, so will update then.

Thanks,

Serguei.

 

 

 

Comment by Olaf Faaland [ 10/Nov/22 ]

Hi Serguei,

Herre are the rest of the sysctls:

net.ipv4.conf.all.arp_announce = 0
net.ipv4.conf.all.arp_filter = 1
net.ipv4.conf.all.arp_ignore = 0
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.arp_announce = 0
net.ipv4.conf.default.arp_filter = 0
net.ipv4.conf.default.arp_ignore = 0
net.ipv4.conf.default.rp_filter = 1

In my case, I have only one LNet NI per network. Each router node has 2 OPA links (called hsi[01], one not configured in LNet) and one IB link (called san0). In case it helps:

[root@mutt4:~]# lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: o2ib44
      local NI(s):
        - nid: 192.168.128.4@o2ib44
          status: up
          interfaces:
              0: hsi0
    - net type: o2ib100
      local NI(s):
        - nid: 172.19.1.108@o2ib100
          status: up
          interfaces:
              0: san0

[root@mutt4:~]# ibstat | grep -w -e CA -e State -e Physical -e Firmware
CA 'hfi1_0'
        CA type: 
        Firmware version: 1.27.0
                State: Active
                Physical state: LinkUp
CA 'hfi1_1'
        CA type: 
        Firmware version: 1.27.0
                State: Active
                Physical state: LinkUp
CA 'mlx5_0'
        CA type: MT4123
        Firmware version: 20.32.2004
                State: Active
                Physical state: LinkUp
CA 'mlx5_bond_0'
        CA type: MT4125
        Firmware version: 22.32.2004
                State: Active
                Physical state: LinkUp
Comment by Serguei Smirnov [ 24/Nov/22 ]

Hi Olaf,

On my local setup, using b2_15 and the steps-to-reproduce from the earlier comment, it appears that https://review.whamcloud.com/46711 is able to fix the issue with getting stuck on shutdown.

On the other hand, on master branch checking out the commit immediately before this fix causes the issue to appear.

Even though my reproducer is different, I think it is a good candidate to try in your environment. 

Thanks,

Serguei.

Comment by Olaf Faaland [ 29/Nov/22 ]

Thanks, Serguei. I hope to test it tomorrow.

Comment by Olaf Faaland [ 01/Dec/22 ]

Hi Serguei,

I performed a test, with https://review.whamcloud.com/46711, and still see "waiting for 1 peers to disconnect".

My reproducer:
1. Start a lustre file system on garter[1-8], on o2ib100 (mlx)
2. Start LNet on 4 routers, mutt[1-4], on o2ib100 and o2ib44 (opa)
3. Mount the file system on 64 clients on o2ib44, which reach garter through mutt[1-4]
4. Start a 64-node 512-task IOR on the clients, writing to all the OSTs
5. Run "systemctl stop lnet" on mutt3
6. I observe "lnetctl lnet unconfigure" is hung as originally reported, and the stack is the same. The console log for mutt3 shows "waiting for 1 peers to disconnect" repeatedly

Just to be sure, note that we are not using MR.

thanks,
Olaf

Comment by Serguei Smirnov [ 02/Dec/22 ]

Hi Olaf,

It looks like I'm able to reproduce the issue using similar setup. I was using two routers, routing between ib and tcp networks, and lnet_selftest to generate traffic between the ib server and the tcp client.

I should be able to use this to look further into fixing this properly. In the meantime though, I experimented with executing "lnetctl set routing 0" on the router node before running "lustre_rmmod" on it, which seems to prevent it from getting stuck. I wonder if you can give this extra step a try to see if it helps in your case, too, as a kind of temporary workaround.

Thanks,

Serguei.

Comment by Olaf Faaland [ 05/Dec/22 ]

> I experimented with executing "lnetctl set routing 0" on the router node

Good idea.   Doing this before "lnetctl net unconfigure" prevents the hang in kiblnd_shutdown(), thanks.

Comment by Olaf Faaland [ 21/Dec/22 ]

Hi Serguei,

Do you have any update on this issue?

Thanks

Comment by Serguei Smirnov [ 21/Dec/22 ]

Hi Olaf,

Sorry, not yet. It is not addressing the root cause, but, for the lack of better ideas, I was considering changing the shutdown procedure to include "lnetctl set routing 0", but haven't submitted the patch yet.

Thanks,

Serguei.

Comment by Olaf Faaland [ 21/Dec/22 ]

Yes, we are adding "lnetctl set routing 0" to the shutdown tasks in our lnet service file after the holiday break.

Comment by Olaf Faaland [ 09/Jan/23 ]

Hi Serguei,
I've added "lnetctl set routing 0" to our lnet service file. Have you had any success identifying the problem? Thanks

Comment by Serguei Smirnov [ 09/Jan/23 ]

Hi Olaf,

I haven't been able to conclusively identify the problem yet. I believe it has to do with some sort of race on LNet shutdown, but this is kind of obvious. The workaround you applied should be good for most cases, the only scenario it doesn't cover is probably when active router NI's are being brought down/up dynamically. 

Thanks,

Serguei.

Comment by Olaf Faaland [ 19/Apr/23 ]

This workaround seems to be working well for us. We do not bring NIs up and down dynamically in production normally, so the potential problem scenario probably won't occur and thus won't hurt us. So I'll remove topllnl, but leave the ticket open until an actual fix is identified.

Generated at Sat Feb 10 03:25:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.