Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.15.1
-
None
-
4.18.0-372.32.1.1toss.t4.x86_64
lustre-2.15.1_7.llnl-2.t4.x86_64
-
3
-
9223372036854775807
Description
Observed on a lustre router node, while the servers and some of the clients were up and connected. The luster router node has Omnipath on the client side and IB on the lustre server side.
lnetctl lnet unconfigure
hangs with stack
[<0>] kiblnd_shutdown+0x347/0x4e0 [ko2iblnd] [<0>] lnet_shutdown_lndni+0x2b6/0x4c0 [lnet] [<0>] lnet_shutdown_lndnet+0x6c/0xb0 [lnet] [<0>] lnet_shutdown_lndnets+0x11e/0x300 [lnet] [<0>] LNetNIFini+0xb7/0x130 [lnet] [<0>] lnet_ioctl+0x220/0x260 [lnet] [<0>] notifier_call_chain+0x47/0x70 [<0>] blocking_notifier_call_chain+0x42/0x60 [<0>] libcfs_psdev_ioctl+0x346/0x590 [libcfs] [<0>] do_vfs_ioctl+0xa5/0x740 [<0>] ksys_ioctl+0x64/0xa0 [<0>] __x64_sys_ioctl+0x16/0x20 [<0>] do_syscall_64+0x5b/0x1b0 [<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6
Debug log shows it's waiting for 3 peers, even after 3700 seconds:
00000800:00000200:1.0:1667256015.359743:0:35023:0:(o2iblnd.c:3049:kiblnd_shutdown()) 172.19.1.108@o2ib100: waiting for 3 peers to disconnect ... 00000800:00000200:3.0:1667259799.039743:0:35023:0:(o2iblnd.c:3049:kiblnd_shutdown()) 172.19.1.108@o2ib100: waiting for 3 peers to disconnect
Before the shutdown there were 38 peers, all reported as "up"
For patch stack, see https://github.com/LLNL/lustre/releases/tag/2.15.1_7.llnl
For my reference, my local ticket is TOSS5826
Attachments
Issue Links
- is related to
-
LU-17480 lustre_rmmod hangs if a lnet route is down
-
- Resolved
-
Hi Olaf,
It looks like I'm able to reproduce the issue using similar setup. I was using two routers, routing between ib and tcp networks, and lnet_selftest to generate traffic between the ib server and the tcp client.
I should be able to use this to look further into fixing this properly. In the meantime though, I experimented with executing "lnetctl set routing 0" on the router node before running "lustre_rmmod" on it, which seems to prevent it from getting stuck. I wonder if you can give this extra step a try to see if it helps in your case, too, as a kind of temporary workaround.
Thanks,
Serguei.