Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.15.1
-
None
-
4.18.0-372.32.1.1toss.t4.x86_64
lustre-2.15.1_7.llnl-2.t4.x86_64
-
3
-
9223372036854775807
Description
Observed on a lustre router node, while the servers and some of the clients were up and connected. The luster router node has Omnipath on the client side and IB on the lustre server side.
lnetctl lnet unconfigure
hangs with stack
[<0>] kiblnd_shutdown+0x347/0x4e0 [ko2iblnd] [<0>] lnet_shutdown_lndni+0x2b6/0x4c0 [lnet] [<0>] lnet_shutdown_lndnet+0x6c/0xb0 [lnet] [<0>] lnet_shutdown_lndnets+0x11e/0x300 [lnet] [<0>] LNetNIFini+0xb7/0x130 [lnet] [<0>] lnet_ioctl+0x220/0x260 [lnet] [<0>] notifier_call_chain+0x47/0x70 [<0>] blocking_notifier_call_chain+0x42/0x60 [<0>] libcfs_psdev_ioctl+0x346/0x590 [libcfs] [<0>] do_vfs_ioctl+0xa5/0x740 [<0>] ksys_ioctl+0x64/0xa0 [<0>] __x64_sys_ioctl+0x16/0x20 [<0>] do_syscall_64+0x5b/0x1b0 [<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6
Debug log shows it's waiting for 3 peers, even after 3700 seconds:
00000800:00000200:1.0:1667256015.359743:0:35023:0:(o2iblnd.c:3049:kiblnd_shutdown()) 172.19.1.108@o2ib100: waiting for 3 peers to disconnect ... 00000800:00000200:3.0:1667259799.039743:0:35023:0:(o2iblnd.c:3049:kiblnd_shutdown()) 172.19.1.108@o2ib100: waiting for 3 peers to disconnect
Before the shutdown there were 38 peers, all reported as "up"
For patch stack, see https://github.com/LLNL/lustre/releases/tag/2.15.1_7.llnl
For my reference, my local ticket is TOSS5826
Attachments
Issue Links
- is related to
-
LU-17480 lustre_rmmod hangs if a lnet route is down
- Resolved