Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.15.4
-
TOSS 4.7-2.1
lustre 2.15.4_1.llnl-1
on lustre server asp
-
3
-
9223372036854775807
Description
We combined a OS update from TOSS 4.6-6 to TOSS 4.7-2.1 with a move from lustre 2.14 to 2.15 (2.14.0_22.llnl-1 to lustre-2.15.4_1.llnl-1).
A few hours later we began to see this error, and eventually saw it on all 12 asp server nodes.
2024-01-16 23:13:19 [39638.886090] Lustre: aspls3-OST0004: Client 8f15405e-d4cc-cf3c-7534-051e7352cf50 (at 192.168.128.24@o2ib35) reconnecting 2024-01-16 23:13:19 [39638.896879] Lustre: Skipped 98 previous similar messages 2024-01-16 23:13:34 [39654.404520] LNetError: 165557:0:(peer.c:2194:lnet_destroy_peer_ni_locked()) ASSERTION( list_empty(&lpni->lpni_peer_nis) ) failed: 2024-01-16 23:13:34 [39654.416271] LNetError: 165557:0:(peer.c:2194:lnet_destroy_peer_ni_locked()) LBUG 2024-01-16 23:13:34 [39654.423671] Pid: 165557, comm: kiblnd_sd_00_01 4.18.0-513.9.1.1toss.t4.x86_64 #1 SMP Wed Nov 29 11:04:55 PST 2023 2024-01-16 23:13:34 [39654.433921] Call Trace TBD: 2024-01-16 23:13:34 [39654.436731] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs] 2024-01-16 23:13:34 [39654.441888] [<0>] lbug_with_loc+0x3f/0x70 [libcfs] 2024-01-16 23:13:34 [39654.446688] [<0>] lnet_destroy_peer_ni_locked+0x44d/0x4e0 [lnet] 2024-01-16 23:13:34 [39654.452722] [<0>] lnet_handle_find_routed_path+0x86c/0xee0 [lnet] 2024-01-16 23:13:34 [39654.458845] [<0>] lnet_select_pathway+0xb95/0x16c0 [lnet] 2024-01-16 23:13:34 [39654.464265] [<0>] lnet_send+0x6d/0x1e0 [lnet] 2024-01-16 23:13:34 [39654.468646] [<0>] lnet_parse_local+0x3ed/0xdd0 [lnet] 2024-01-16 23:13:34 [39654.473721] [<0>] lnet_parse+0xd7d/0x1490 [lnet] 2024-01-16 23:13:34 [39654.478366] [<0>] kiblnd_handle_rx+0x30e/0x900 [ko2iblnd] 2024-01-16 23:13:34 [39654.483782] [<0>] kiblnd_scheduler+0x104b/0x10d0 [ko2iblnd] 2024-01-16 23:13:34 [39654.489363] [<0>] kthread+0x14c/0x170 2024-01-16 23:13:34 [39654.493030] [<0>] ret_from_fork+0x1f/0x40 2024-01-16 23:13:34 [39654.497050] Kernel panic - not syncing: LBUG 2024-01-16 23:13:34 [39654.501320] CPU: 47 PID: 165557 Comm: kiblnd_sd_00_01 Kdump: loaded Tainted: P OE X --------- - - 4.18.0-513.9.1.1toss.t\ 4.x86_64 #1 2024-01-16 23:13:34 [39654.514172] Hardware name: Supermicro SSG-229P-DN2R24264-LL013/X11DSN-TS, BIOS 3.4 11/04/2020 2024-01-16 23:13:34 [39654.522683] Call Trace: 2024-01-16 23:13:34 [39654.525137] dump_stack+0x41/0x60 2024-01-16 23:13:34 [39654.528457] panic+0xe7/0x2ac 2024-01-16 23:13:34 [39654.531429] ? ret_from_fork+0x1f/0x40 2024-01-16 23:13:34 [39654.535182] lbug_with_loc.cold.8+0x18/0x18 [libcfs] 2024-01-16 23:13:34 [39654.540156] lnet_destroy_peer_ni_locked+0x44d/0x4e0 [lnet] 2024-01-16 23:13:35 [39654.545747] lnet_handle_find_routed_path+0x86c/0xee0 [lnet] 2024-01-16 23:13:35 [39654.551423] ? lnet_peer_ni_find_locked+0x14/0x30 [lnet] 2024-01-16 23:13:35 [39654.556753] lnet_select_pathway+0xb95/0x16c0 [lnet] 2024-01-16 23:13:35 [39654.561735] ? kiblnd_check_sends_locked+0x1a5/0x4a0 [ko2iblnd] 2024-01-16 23:13:35 [39654.567656] lnet_send+0x6d/0x1e0 [lnet] 2024-01-16 23:13:35 [39654.571600] lnet_parse_local+0x3ed/0xdd0 [lnet] 2024-01-16 23:13:35 [39654.576238] lnet_parse+0xd7d/0x1490 [lnet] 2024-01-16 23:13:35 [39654.580438] ? try_to_wake_up+0x1c2/0x4f0 2024-01-16 23:13:35 [39654.584454] kiblnd_handle_rx+0x30e/0x900 [ko2iblnd] 2024-01-16 23:13:35 [39654.589427] ? __wake_up_common+0x7a/0x190 2024-01-16 23:13:35 [39654.593526] kiblnd_scheduler+0x104b/0x10d0 [ko2iblnd] 2024-01-16 23:13:35 [39654.598665] ? finish_wait+0x90/0x90 2024-01-16 23:13:35 [39654.602245] ? kiblnd_cq_event+0x80/0x80 [ko2iblnd] 2024-01-16 23:13:35 [39654.607125] kthread+0x14c/0x170 2024-01-16 23:13:35 [39654.610357] ? set_kthread_struct+0x50/0x50 2024-01-16 23:13:35 [39654.614544] ret_from_fork+0x1f/0x40 2024-01-16 23:13:36 [ 0.000000] Linux version 4.18.0-513.9.1.1toss.t4.x86_64 (mockbuild@builder2-x86.buildfarm.internal) (gcc version 8.5.0 20210514 (Re\ d Hat 8.5.0-20) (GCC)) #1 SMP Wed Nov 29 11:04:55 PST 2023
Attachments
Issue Links
- is related to
-
LU-18320 interop: sanity-lnet test_226: ASSERTION( list_empty(&lpni->lpni_peer_nis) ) failed
- Resolved