Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.15.1
-
None
-
4.18.0-372.32.1.1toss.t4.x86_64
lustre-2.15.1_7.llnl-2.t4.x86_64
-
3
-
9223372036854775807
Description
Observed on a lustre router node, while the servers and some of the clients were up and connected. The luster router node has Omnipath on the client side and IB on the lustre server side.
lnetctl lnet unconfigure
hangs with stack
[<0>] kiblnd_shutdown+0x347/0x4e0 [ko2iblnd] [<0>] lnet_shutdown_lndni+0x2b6/0x4c0 [lnet] [<0>] lnet_shutdown_lndnet+0x6c/0xb0 [lnet] [<0>] lnet_shutdown_lndnets+0x11e/0x300 [lnet] [<0>] LNetNIFini+0xb7/0x130 [lnet] [<0>] lnet_ioctl+0x220/0x260 [lnet] [<0>] notifier_call_chain+0x47/0x70 [<0>] blocking_notifier_call_chain+0x42/0x60 [<0>] libcfs_psdev_ioctl+0x346/0x590 [libcfs] [<0>] do_vfs_ioctl+0xa5/0x740 [<0>] ksys_ioctl+0x64/0xa0 [<0>] __x64_sys_ioctl+0x16/0x20 [<0>] do_syscall_64+0x5b/0x1b0 [<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6
Debug log shows it's waiting for 3 peers, even after 3700 seconds:
00000800:00000200:1.0:1667256015.359743:0:35023:0:(o2iblnd.c:3049:kiblnd_shutdown()) 172.19.1.108@o2ib100: waiting for 3 peers to disconnect ... 00000800:00000200:3.0:1667259799.039743:0:35023:0:(o2iblnd.c:3049:kiblnd_shutdown()) 172.19.1.108@o2ib100: waiting for 3 peers to disconnect
Before the shutdown there were 38 peers, all reported as "up"
For patch stack, see https://github.com/LLNL/lustre/releases/tag/2.15.1_7.llnl
For my reference, my local ticket is TOSS5826
Attachments
Issue Links
- is related to
-
LU-17480 lustre_rmmod hangs if a lnet route is down
-
- Resolved
-
This workaround seems to be working well for us. We do not bring NIs up and down dynamically in production normally, so the potential problem scenario probably won't occur and thus won't hurt us. So I'll remove topllnl, but leave the ticket open until an actual fix is identified.