Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.12.6
-
None
-
CentOS 7.9, Lustre 2.12.6 servers, clients: 2.12.6, 2.13 and 2.14
-
3
-
9223372036854775807
Description
We hit this OSS LBUG 3 times in total lately with Lustre 2.12.6:
[330956.154500] LNetError: 50573:0:(peer.c:282:lnet_destroy_peer_locked()) ASSERTION( list_empty(&lp->lp_peer_nets) ) failed: [330956.166945] LNetError: 50573:0:(peer.c:282:lnet_destroy_peer_locked()) LBUG [330956.174812] Pid: 50573, comm: lnet_discovery 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020 [330956.186861] Call Trace: [330956.189703] [<ffffffffc0a2f7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [330956.197114] [<ffffffffc0a2f87c>] lbug_with_loc+0x4c/0xa0 [libcfs] [330956.204125] [<ffffffffc0d5afca>] lnet_destroy_peer_locked+0x24a/0x350 [lnet] [330956.212254] [<ffffffffc0d5b605>] lnet_peer_discovery_complete+0x2a5/0x350 [lnet] [330956.220715] [<ffffffffc0d60340>] lnet_peer_discovery+0x6c0/0x1140 [lnet] [330956.228410] [<ffffffff9b0c5c21>] kthread+0xd1/0xe0 [330956.233965] [<ffffffff9b794ddd>] ret_from_fork_nospec_begin+0x7/0x21 [330956.241271] [<ffffffffffffffff>] 0xffffffffffffffff [330956.246936] Kernel panic - not syncing: LBUG [330956.251797] CPU: 22 PID: 50573 Comm: lnet_discovery Kdump: loaded Tainted: G OE ------------ 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 [330956.266548] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.6.0 10/26/2017 [330956.274996] Call Trace: [330956.277828] [<ffffffff9b781400>] dump_stack+0x19/0x1b [330956.283659] [<ffffffff9b77a958>] panic+0xe8/0x21f [330956.289118] [<ffffffffc0a2f8cb>] lbug_with_loc+0x9b/0xa0 [libcfs] [330956.296131] [<ffffffffc0d5afca>] lnet_destroy_peer_locked+0x24a/0x350 [lnet] [330956.304198] [<ffffffffc0d5b605>] lnet_peer_discovery_complete+0x2a5/0x350 [lnet] [330956.312653] [<ffffffffc0d60340>] lnet_peer_discovery+0x6c0/0x1140 [lnet] [330956.320328] [<ffffffff9b0c6d10>] ? wake_up_atomic_t+0x30/0x30 [330956.326940] [<ffffffffc0d5fc80>] ? lnet_peer_merge_data+0xe00/0xe00 [lnet] [330956.334805] [<ffffffff9b0c5c21>] kthread+0xd1/0xe0 [330956.340344] [<ffffffff9b0c5b50>] ? insert_kthread_work+0x40/0x40 [330956.347242] [<ffffffff9b794ddd>] ret_from_fork_nospec_begin+0x7/0x21 [330956.354526] [<ffffffff9b0c5b50>] ? insert_kthread_work+0x40/0x40
It looks like a duplicate of LU-13652 which is supposedly fixed in 2.12.6, but as we're running Lustre 2.12.6 on all servers on this system (Oak) so I'm opening a new issue tagged 2.12.6.
Attaching "foreach bt" from the crash dump (available upon request) as oak-io2-s1_foreachbt.txt
Attaching vmcore-dmesg.txt as oak-io2-s1_vmcore-dmesg_2021-04-21-17-33-41.txt
LNet discovery is supposed to be enabled everywhere:
[root@oak-io2-s1 127.0.0.1-2021-04-21-17:33:41]# lnetctl global show global: numa_range: 0 max_intf: 200 discovery: 1 drop_asym_route: 0 retry_count: 0 transaction_timeout: 50 health_sensitivity: 0 recovery_interval: 1
Attachments
Issue Links
- is related to
-
LU-14627 Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&lp->lp_peer_nets) ) failed:
- Resolved