[LU-14635] LBUG LNetError: 50573:0:(peer.c:282:lnet_destroy_peer_locked()) ASSERTION( list_empty(&lp->lp_peer_nets) ) failed Created: 23/Apr/21  Updated: 23/Apr/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.6
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Stephane Thiell Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7.9, Lustre 2.12.6 servers, clients: 2.12.6, 2.13 and 2.14


Attachments: Text File oak-io2-s1_foreachbt.txt     Text File oak-io2-s1_vmcore-dmesg_2021-04-21-17-33-41.txt    
Issue Links:
Related
is related to LU-14627 Lost ref on lnet_peer in discovery le... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We hit this OSS LBUG 3 times in total lately with Lustre 2.12.6:

[330956.154500] LNetError: 50573:0:(peer.c:282:lnet_destroy_peer_locked()) ASSERTION( list_empty(&lp->lp_peer_nets) ) failed: 
[330956.166945] LNetError: 50573:0:(peer.c:282:lnet_destroy_peer_locked()) LBUG
[330956.174812] Pid: 50573, comm: lnet_discovery 3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1 SMP Mon Dec 14 21:25:04 PST 2020
[330956.186861] Call Trace:
[330956.189703]  [<ffffffffc0a2f7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[330956.197114]  [<ffffffffc0a2f87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[330956.204125]  [<ffffffffc0d5afca>] lnet_destroy_peer_locked+0x24a/0x350 [lnet]
[330956.212254]  [<ffffffffc0d5b605>] lnet_peer_discovery_complete+0x2a5/0x350 [lnet]
[330956.220715]  [<ffffffffc0d60340>] lnet_peer_discovery+0x6c0/0x1140 [lnet]
[330956.228410]  [<ffffffff9b0c5c21>] kthread+0xd1/0xe0
[330956.233965]  [<ffffffff9b794ddd>] ret_from_fork_nospec_begin+0x7/0x21
[330956.241271]  [<ffffffffffffffff>] 0xffffffffffffffff
[330956.246936] Kernel panic - not syncing: LBUG
[330956.251797] CPU: 22 PID: 50573 Comm: lnet_discovery Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.6.1.el7_lustre.pl1.x86_64 #1
[330956.266548] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.6.0 10/26/2017
[330956.274996] Call Trace:
[330956.277828]  [<ffffffff9b781400>] dump_stack+0x19/0x1b
[330956.283659]  [<ffffffff9b77a958>] panic+0xe8/0x21f
[330956.289118]  [<ffffffffc0a2f8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[330956.296131]  [<ffffffffc0d5afca>] lnet_destroy_peer_locked+0x24a/0x350 [lnet]
[330956.304198]  [<ffffffffc0d5b605>] lnet_peer_discovery_complete+0x2a5/0x350 [lnet]
[330956.312653]  [<ffffffffc0d60340>] lnet_peer_discovery+0x6c0/0x1140 [lnet]
[330956.320328]  [<ffffffff9b0c6d10>] ? wake_up_atomic_t+0x30/0x30
[330956.326940]  [<ffffffffc0d5fc80>] ? lnet_peer_merge_data+0xe00/0xe00 [lnet]
[330956.334805]  [<ffffffff9b0c5c21>] kthread+0xd1/0xe0
[330956.340344]  [<ffffffff9b0c5b50>] ? insert_kthread_work+0x40/0x40
[330956.347242]  [<ffffffff9b794ddd>] ret_from_fork_nospec_begin+0x7/0x21
[330956.354526]  [<ffffffff9b0c5b50>] ? insert_kthread_work+0x40/0x40

It looks like a duplicate of LU-13652 which is supposedly fixed in 2.12.6, but as we're running Lustre 2.12.6 on all servers on this system (Oak) so I'm opening a new issue tagged 2.12.6.

Attaching "foreach bt" from the crash dump (available upon request) as oak-io2-s1_foreachbt.txt
Attaching vmcore-dmesg.txt as oak-io2-s1_vmcore-dmesg_2021-04-21-17-33-41.txt

 
LNet discovery is supposed to be enabled everywhere:

[root@oak-io2-s1 127.0.0.1-2021-04-21-17:33:41]# lnetctl global show
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
    retry_count: 0
    transaction_timeout: 50
    health_sensitivity: 0
    recovery_interval: 1

 



 Comments   
Comment by Peter Jones [ 23/Apr/21 ]

Sergeui

Could you please advise?

Thanks

Peter

Comment by Serguei Smirnov [ 23/Apr/21 ]

It looks like the same issue is tracked in LU-14627

Generated at Sat Feb 10 03:11:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.