Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.15.1
-
None
-
4.18.0-372.32.1.1toss.t4.x86_64
lustre-2.15.1_7.llnl-2.t4.x86_64
-
3
-
9223372036854775807
Description
Observed on a lustre router node, while the servers and some of the clients were up and connected. The luster router node has Omnipath on the client side and IB on the lustre server side.
lnetctl lnet unconfigure
hangs with stack
[<0>] kiblnd_shutdown+0x347/0x4e0 [ko2iblnd] [<0>] lnet_shutdown_lndni+0x2b6/0x4c0 [lnet] [<0>] lnet_shutdown_lndnet+0x6c/0xb0 [lnet] [<0>] lnet_shutdown_lndnets+0x11e/0x300 [lnet] [<0>] LNetNIFini+0xb7/0x130 [lnet] [<0>] lnet_ioctl+0x220/0x260 [lnet] [<0>] notifier_call_chain+0x47/0x70 [<0>] blocking_notifier_call_chain+0x42/0x60 [<0>] libcfs_psdev_ioctl+0x346/0x590 [libcfs] [<0>] do_vfs_ioctl+0xa5/0x740 [<0>] ksys_ioctl+0x64/0xa0 [<0>] __x64_sys_ioctl+0x16/0x20 [<0>] do_syscall_64+0x5b/0x1b0 [<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6
Debug log shows it's waiting for 3 peers, even after 3700 seconds:
00000800:00000200:1.0:1667256015.359743:0:35023:0:(o2iblnd.c:3049:kiblnd_shutdown()) 172.19.1.108@o2ib100: waiting for 3 peers to disconnect ... 00000800:00000200:3.0:1667259799.039743:0:35023:0:(o2iblnd.c:3049:kiblnd_shutdown()) 172.19.1.108@o2ib100: waiting for 3 peers to disconnect
Before the shutdown there were 38 peers, all reported as "up"
For patch stack, see https://github.com/LLNL/lustre/releases/tag/2.15.1_7.llnl
For my reference, my local ticket is TOSS5826
Attachments
Issue Links
- is related to
-
LU-17480 lustre_rmmod hangs if a lnet route is down
-
- Resolved
-
Hi Olaf,
I haven't been able to conclusively identify the problem yet. I believe it has to do with some sort of race on LNet shutdown, but this is kind of obvious. The workaround you applied should be good for most cases, the only scenario it doesn't cover is probably when active router NI's are being brought down/up dynamically.
Thanks,
Serguei.