[LU-16283] o2iblnd.c:3049:kiblnd_shutdown() <NID>: waiting for <N> peers to disconnect - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.15.1
Labels:
None
Environment:
4.18.0-372.32.1.1toss.t4.x86_64
lustre-2.15.1_7.llnl-2.t4.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Observed on a lustre router node, while the servers and some of the clients were up and connected. The luster router node has Omnipath on the client side and IB on the lustre server side.

lnetctl lnet unconfigure

hangs with stack

[<0>] kiblnd_shutdown+0x347/0x4e0 [ko2iblnd]
[<0>] lnet_shutdown_lndni+0x2b6/0x4c0 [lnet]
[<0>] lnet_shutdown_lndnet+0x6c/0xb0 [lnet]
[<0>] lnet_shutdown_lndnets+0x11e/0x300 [lnet]
[<0>] LNetNIFini+0xb7/0x130 [lnet]
[<0>] lnet_ioctl+0x220/0x260 [lnet]
[<0>] notifier_call_chain+0x47/0x70
[<0>] blocking_notifier_call_chain+0x42/0x60
[<0>] libcfs_psdev_ioctl+0x346/0x590 [libcfs]
[<0>] do_vfs_ioctl+0xa5/0x740
[<0>] ksys_ioctl+0x64/0xa0
[<0>] __x64_sys_ioctl+0x16/0x20
[<0>] do_syscall_64+0x5b/0x1b0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6

Debug log shows it's waiting for 3 peers, even after 3700 seconds:

00000800:00000200:1.0:1667256015.359743:0:35023:0:(o2iblnd.c:3049:kiblnd_shutdown()) 172.19.1.108@o2ib100: waiting for 3 peers to disconnect 
...
00000800:00000200:3.0:1667259799.039743:0:35023:0:(o2iblnd.c:3049:kiblnd_shutdown()) 172.19.1.108@o2ib100: waiting for 3 peers to disconnect

Before the shutdown there were 38 peers, all reported as "up"

For patch stack, see https://github.com/LLNL/lustre/releases/tag/2.15.1_7.llnl

For my reference, my local ticket is TOSS5826

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

dk.mutt4.1.gz
33 kB
01/Nov/22 12:20 AM
dk.mutt4.2.gz
256 kB
01/Nov/22 12:20 AM
dk.mutt4.3.gz
57 kB
01/Nov/22 12:20 AM
dmesg.mutt4.1667256190.gz
32 kB
01/Nov/22 12:20 AM
dmesg.mutt4.1667259716.gz
0.6 kB
01/Nov/22 12:20 AM
lnetctl.peer.show.mutt4.1.gz
1 kB
01/Nov/22 12:20 AM

Issue Links

is related to

LU-17480 lustre_rmmod hangs if a lnet route is down

Resolved

Activity

[LU-16283] o2iblnd.c:3049:kiblnd_shutdown() <NID>: waiting for <N> peers to disconnect

Olaf Faaland added a comment - 09/Jan/23 6:17 PM

Hi Serguei,
I've added "lnetctl set routing 0" to our lnet service file. Have you had any success identifying the problem? Thanks

Olaf Faaland added a comment - 09/Jan/23 6:17 PM Hi Serguei, I've added "lnetctl set routing 0" to our lnet service file. Have you had any success identifying the problem? Thanks

Olaf Faaland added a comment - 21/Dec/22 2:03 AM

Yes, we are adding "lnetctl set routing 0" to the shutdown tasks in our lnet service file after the holiday break.

Olaf Faaland added a comment - 21/Dec/22 2:03 AM Yes, we are adding "lnetctl set routing 0" to the shutdown tasks in our lnet service file after the holiday break.

Serguei Smirnov added a comment - 21/Dec/22 1:25 AM

Hi Olaf,

Sorry, not yet. It is not addressing the root cause, but, for the lack of better ideas, I was considering changing the shutdown procedure to include "lnetctl set routing 0", but haven't submitted the patch yet.

Thanks,

Serguei.

Serguei Smirnov added a comment - 21/Dec/22 1:25 AM Hi Olaf, Sorry, not yet. It is not addressing the root cause, but, for the lack of better ideas, I was considering changing the shutdown procedure to include "lnetctl set routing 0", but haven't submitted the patch yet. Thanks, Serguei.

Olaf Faaland added a comment - 21/Dec/22 12:56 AM

Hi Serguei,

Do you have any update on this issue?

Thanks

Olaf Faaland added a comment - 21/Dec/22 12:56 AM Hi Serguei, Do you have any update on this issue? Thanks

Olaf Faaland added a comment - 05/Dec/22 8:13 PM

> I experimented with executing "lnetctl set routing 0" on the router node

Good idea. Doing this before "lnetctl net unconfigure" prevents the hang in kiblnd_shutdown(), thanks.

Olaf Faaland added a comment - 05/Dec/22 8:13 PM > I experimented with executing "lnetctl set routing 0" on the router node Good idea. Doing this before "lnetctl net unconfigure" prevents the hang in kiblnd_shutdown(), thanks.

Serguei Smirnov added a comment - 02/Dec/22 11:03 PM

Hi Olaf,

It looks like I'm able to reproduce the issue using similar setup. I was using two routers, routing between ib and tcp networks, and lnet_selftest to generate traffic between the ib server and the tcp client.

I should be able to use this to look further into fixing this properly. In the meantime though, I experimented with executing "lnetctl set routing 0" on the router node before running "lustre_rmmod" on it, which seems to prevent it from getting stuck. I wonder if you can give this extra step a try to see if it helps in your case, too, as a kind of temporary workaround.

Thanks,

Serguei.

Serguei Smirnov added a comment - 02/Dec/22 11:03 PM Hi Olaf, It looks like I'm able to reproduce the issue using similar setup. I was using two routers, routing between ib and tcp networks, and lnet_selftest to generate traffic between the ib server and the tcp client. I should be able to use this to look further into fixing this properly. In the meantime though, I experimented with executing "lnetctl set routing 0" on the router node before running "lustre_rmmod" on it, which seems to prevent it from getting stuck. I wonder if you can give this extra step a try to see if it helps in your case, too, as a kind of temporary workaround. Thanks, Serguei.

Olaf Faaland added a comment - 01/Dec/22 12:47 AM

Hi Serguei,

I performed a test, with https://review.whamcloud.com/46711, and still see "waiting for 1 peers to disconnect".

My reproducer:
1. Start a lustre file system on garter[1-8], on o2ib100 (mlx)
2. Start LNet on 4 routers, mutt[1-4], on o2ib100 and o2ib44 (opa)
3. Mount the file system on 64 clients on o2ib44, which reach garter through mutt[1-4]
4. Start a 64-node 512-task IOR on the clients, writing to all the OSTs
5. Run "systemctl stop lnet" on mutt3
6. I observe "lnetctl lnet unconfigure" is hung as originally reported, and the stack is the same. The console log for mutt3 shows "waiting for 1 peers to disconnect" repeatedly

Just to be sure, note that we are not using MR.

thanks,
Olaf

Olaf Faaland added a comment - 01/Dec/22 12:47 AM Hi Serguei, I performed a test, with https://review.whamcloud.com/46711 , and still see "waiting for 1 peers to disconnect". My reproducer: 1. Start a lustre file system on garter [1-8] , on o2ib100 (mlx) 2. Start LNet on 4 routers, mutt [1-4] , on o2ib100 and o2ib44 (opa) 3. Mount the file system on 64 clients on o2ib44, which reach garter through mutt [1-4] 4. Start a 64-node 512-task IOR on the clients, writing to all the OSTs 5. Run "systemctl stop lnet" on mutt3 6. I observe "lnetctl lnet unconfigure" is hung as originally reported, and the stack is the same. The console log for mutt3 shows "waiting for 1 peers to disconnect" repeatedly Just to be sure, note that we are not using MR. thanks, Olaf

Olaf Faaland added a comment - 29/Nov/22 12:43 AM

Thanks, Serguei. I hope to test it tomorrow.

Olaf Faaland added a comment - 29/Nov/22 12:43 AM Thanks, Serguei. I hope to test it tomorrow.

Serguei Smirnov added a comment - 24/Nov/22 11:09 PM

Hi Olaf,

On my local setup, using b2_15 and the steps-to-reproduce from the earlier comment, it appears that https://review.whamcloud.com/46711 is able to fix the issue with getting stuck on shutdown.

On the other hand, on master branch checking out the commit immediately before this fix causes the issue to appear.

Even though my reproducer is different, I think it is a good candidate to try in your environment.

Thanks,

Serguei.

Serguei Smirnov added a comment - 24/Nov/22 11:09 PM Hi Olaf, On my local setup, using b2_15 and the steps-to-reproduce from the earlier comment, it appears that https://review.whamcloud.com/46711 is able to fix the issue with getting stuck on shutdown. On the other hand, on master branch checking out the commit immediately before this fix causes the issue to appear. Even though my reproducer is different, I think it is a good candidate to try in your environment. Thanks, Serguei.

Olaf Faaland added a comment - 10/Nov/22 9:43 PM

Hi Serguei,

Herre are the rest of the sysctls:

net.ipv4.conf.all.arp_announce = 0
net.ipv4.conf.all.arp_filter = 1
net.ipv4.conf.all.arp_ignore = 0
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.arp_announce = 0
net.ipv4.conf.default.arp_filter = 0
net.ipv4.conf.default.arp_ignore = 0
net.ipv4.conf.default.rp_filter = 1

In my case, I have only one LNet NI per network. Each router node has 2 OPA links (called hsi[01], one not configured in LNet) and one IB link (called san0). In case it helps:

[root@mutt4:~]# lnetctl net show
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: o2ib44
      local NI(s):
        - nid: 192.168.128.4@o2ib44
          status: up
          interfaces:
              0: hsi0
    - net type: o2ib100
      local NI(s):
        - nid: 172.19.1.108@o2ib100
          status: up
          interfaces:
              0: san0

[root@mutt4:~]# ibstat | grep -w -e CA -e State -e Physical -e Firmware
CA 'hfi1_0'
        CA type: 
        Firmware version: 1.27.0
                State: Active
                Physical state: LinkUp
CA 'hfi1_1'
        CA type: 
        Firmware version: 1.27.0
                State: Active
                Physical state: LinkUp
CA 'mlx5_0'
        CA type: MT4123
        Firmware version: 20.32.2004
                State: Active
                Physical state: LinkUp
CA 'mlx5_bond_0'
        CA type: MT4125
        Firmware version: 22.32.2004
                State: Active
                Physical state: LinkUp

Olaf Faaland added a comment - 10/Nov/22 9:43 PM Hi Serguei, Herre are the rest of the sysctls: net.ipv4.conf.all.arp_announce = 0 net.ipv4.conf.all.arp_filter = 1 net.ipv4.conf.all.arp_ignore = 0 net.ipv4.conf.all.rp_filter = 1 net.ipv4.conf.default.arp_announce = 0 net.ipv4.conf.default.arp_filter = 0 net.ipv4.conf.default.arp_ignore = 0 net.ipv4.conf.default.rp_filter = 1 In my case, I have only one LNet NI per network. Each router node has 2 OPA links (called hsi [01] , one not configured in LNet) and one IB link (called san0). In case it helps: [root@mutt4:~]# lnetctl net show net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: o2ib44 local NI(s): - nid: 192.168.128.4@o2ib44 status: up interfaces: 0: hsi0 - net type: o2ib100 local NI(s): - nid: 172.19.1.108@o2ib100 status: up interfaces: 0: san0 [root@mutt4:~]# ibstat | grep -w -e CA -e State -e Physical -e Firmware CA 'hfi1_0' CA type: Firmware version: 1.27.0 State: Active Physical state: LinkUp CA 'hfi1_1' CA type: Firmware version: 1.27.0 State: Active Physical state: LinkUp CA 'mlx5_0' CA type: MT4123 Firmware version: 20.32.2004 State: Active Physical state: LinkUp CA 'mlx5_bond_0' CA type: MT4125 Firmware version: 22.32.2004 State: Active Physical state: LinkUp

Serguei Smirnov added a comment - 09/Nov/22 7:31 PM

Hi Olaf,

I don't need to see sysctl settings for all interfaces, only for those involved, but I would like to see "all" and "default" settings as these may affect which value gets used, depending on the parameter. For example, max of {all, interface} values gets used for rp_filter.

In any case, this is probably most important only in MR case when there are multiple interfaces on the same lnet.

So far the reproducer I found for my local system is this:

b2_15 lustre server with two interfaces on the same o2ib lnet. Runs both mds and oss.
b2_15 lustre client running on a VM hosted on a different machine, configured to use a single ib interface on the same o2ib net.
Mount FS on the client. Mount command lists both server nids. Verify it worked (I use ls on the mounted directory)
Unmount FS on the client.
Pull cable on the first of the server nids.
Mount FS on the client same way as before. Verify it works just the same.
Unmount mdt and oss/ost on the server. This should succeed.
Run "lustre_rmmod" on the server. This appears to get stuck indefinitely.

The messages in the debug log on the server at the time when it gets stuck look similar to this:

00000800:00020000:1.0:1667958274.439619:0:1967799:0:(o2iblnd_cb.c:2490:kiblnd_passive_connect()) Can't accept conn from 10.1.0.50@o2ib on NA (ib1:1:10.1.0.21): bad dst nid 10.1.0.21@o2ib
00000800:00020000:7.0:1667958279.437379:0:1983222:0:(o2iblnd_cb.c:2490:kiblnd_passive_connect()) Can't accept conn from 10.1.0.50@o2ib on NA (ib1:1:10.1.0.21): bad dst nid 10.1.0.21@o2ib
00000800:00020000:7.0:1667958279.450225:0:1983222:0:(o2iblnd_cb.c:2490:kiblnd_passive_connect()) Can't accept conn from 10.1.0.50@o2ib on NA (ib1:1:10.1.0.21): bad dst nid 10.1.0.21@o2ib 00000800:00000200:4.0:1667958310.280478:0:2280634:0:(o2iblnd.c:3049:kiblnd_shutdown()) 10.1.0.21@o2ib: waiting for 1 peers to disconnect 00000800:00000100:4.0:1667958316.514548:0:1758823:0:(o2iblnd.c:2530:kiblnd_set_ni_fatal_on()) Fatal device error for NI 10.1.0.21@o2ib
00000800:00000200:3.0F:1667958571.400477:0:2280634:0:(o2iblnd.c:3049:kiblnd_shutdown()) 10.1.0.21@o2ib: waiting for 1 peers to disconnect
00000800:00000200:2.0:1667959094.664490:0:2280634:0:(o2iblnd.c:3049:kiblnd_shutdown()) 10.1.0.21@o2ib: waiting for 1 peers to disconnect
00000800:00000200:5.0F:1667960142.216481:0:2280634:0:(o2iblnd.c:3049:kiblnd_shutdown()) 10.1.0.21@o2ib: waiting for 1 peers to disconnect

"Fatal device error" reported against the remaining connected ib interface can be ignored as it is caused by shutting down the IB switch.

I tested with and without "rollback of ~~LU-13368~~ lnet: discard the callback" on the server to the same effect. Haven't tried with latest master - planning to do that next, so will update then.

Thanks,

Serguei.

Serguei Smirnov added a comment - 09/Nov/22 7:31 PM Hi Olaf, I don't need to see sysctl settings for all interfaces, only for those involved, but I would like to see "all" and "default" settings as these may affect which value gets used, depending on the parameter. For example, max of {all, interface} values gets used for rp_filter. In any case, this is probably most important only in MR case when there are multiple interfaces on the same lnet. So far the reproducer I found for my local system is this: b2_15 lustre server with two interfaces on the same o2ib lnet. Runs both mds and oss. b2_15 lustre client running on a VM hosted on a different machine, configured to use a single ib interface on the same o2ib net. Mount FS on the client. Mount command lists both server nids. Verify it worked (I use ls on the mounted directory) Unmount FS on the client. Pull cable on the first of the server nids. Mount FS on the client same way as before. Verify it works just the same. Unmount mdt and oss/ost on the server. This should succeed. Run "lustre_rmmod" on the server. This appears to get stuck indefinitely. The messages in the debug log on the server at the time when it gets stuck look similar to this: 00000800:00020000:1.0:1667958274.439619:0:1967799:0:(o2iblnd_cb.c:2490:kiblnd_passive_connect()) Can't accept conn from 10.1.0.50@o2ib on NA (ib1:1:10.1.0.21): bad dst nid 10.1.0.21@o2ib 00000800:00020000:7.0:1667958279.437379:0:1983222:0:(o2iblnd_cb.c:2490:kiblnd_passive_connect()) Can't accept conn from 10.1.0.50@o2ib on NA (ib1:1:10.1.0.21): bad dst nid 10.1.0.21@o2ib 00000800:00020000:7.0:1667958279.450225:0:1983222:0:(o2iblnd_cb.c:2490:kiblnd_passive_connect()) Can't accept conn from 10.1.0.50@o2ib on NA (ib1:1:10.1.0.21): bad dst nid 10.1.0.21@o2ib 00000800:00000200:4.0:1667958310.280478:0:2280634:0:(o2iblnd.c:3049:kiblnd_shutdown()) 10.1.0.21@o2ib: waiting for 1 peers to disconnect 00000800:00000100:4.0:1667958316.514548:0:1758823:0:(o2iblnd.c:2530:kiblnd_set_ni_fatal_on()) Fatal device error for NI 10.1.0.21@o2ib 00000800:00000200:3.0F:1667958571.400477:0:2280634:0:(o2iblnd.c:3049:kiblnd_shutdown()) 10.1.0.21@o2ib: waiting for 1 peers to disconnect 00000800:00000200:2.0:1667959094.664490:0:2280634:0:(o2iblnd.c:3049:kiblnd_shutdown()) 10.1.0.21@o2ib: waiting for 1 peers to disconnect 00000800:00000200:5.0F:1667960142.216481:0:2280634:0:(o2iblnd.c:3049:kiblnd_shutdown()) 10.1.0.21@o2ib: waiting for 1 peers to disconnect "Fatal device error" reported against the remaining connected ib interface can be ignored as it is caused by shutting down the IB switch. I tested with and without "rollback of LU-13368 lnet: discard the callback" on the server to the same effect. Haven't tried with latest master - planning to do that next, so will update then. Thanks, Serguei.

People

Assignee:: Serguei Smirnov

Reporter:: Olaf Faaland

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 01/Nov/22 12:05 AM

Updated:: 29/Jan/24 10:59 AM