[LU-16986] LNetError: 26687:0:(o2iblnd.c:992:kiblnd_destroy_conn()) ASSERTION( conn->ibc_nsends_posted == 0 ) failed Created: 26/Jul/23  Updated: 26/Jul/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7.9 3.10.0-1160.90.1.el7_lustre.pl1.x86_64


Attachments: Text File fir-io2-s1-20230724-vmcore-dmesg.txt    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

With Lustre 2.15.3 on clients (Sherlock) and servers (Fir), we hit this assertion on a Lustre OSS (fir-io2-s1) while we were redeploying LNet routers from 2.15.2 to 2.15.3. We did it slowly but still, I believe by redeploying the routers we put some stress on LNet and triggered this particular condition. I do have a vmcore available upon request.

[1927361.489487] LNetError: 26687:0:(o2iblnd.c:992:kiblnd_destroy_conn()) ASSERTION( conn->ibc_nsends_posted == 0 ) failed:
[1927361.500446] LNetError: 26687:0:(o2iblnd.c:992:kiblnd_destroy_conn()) LBUG
[1927361.507427] Pid: 26687, comm: kiblnd_connd 3.10.0-1160.90.1.el7_lustre.pl1.x86_64 #1 SMP Tue Jun 20 15:47:49 PDT 2023
[1927361.518344] Call Trace:
[1927361.520991] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
[1927361.526319] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
[1927361.531308] [<0>] kiblnd_destroy_conn+0x476/0x650 [ko2iblnd]
[1927361.537253] [<0>] kiblnd_connd+0xfa/0xcb0 [ko2iblnd]
[1927361.542393] [<0>] kthread+0xd1/0xe0
[1927361.546067] [<0>] ret_from_fork_nospec_begin+0x7/0x21
[1927361.551328] [<0>] 0xfffffffffffffffe
[1927361.555087] Kernel panic - not syncing: LBUG
[1927361.559528] CPU: 59 PID: 26687 Comm: kiblnd_connd Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.90.1.el7_lustre.pl1.x86_64 #1
[1927361.572724] Hardware name: Dell Inc. PowerEdge R6525/0N7YGH, BIOS 2.11.3 02/24/2023
[1927361.580548] Call Trace:
[1927361.583167]  [<ffffffff985b1bec>] dump_stack+0x19/0x1f
[1927361.588478]  [<ffffffff985ab708>] panic+0xe8/0x21f
[1927361.593445]  [<ffffffffc092f5eb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[1927361.599796]  [<ffffffffc0a0ca36>] kiblnd_destroy_conn+0x476/0x650 [ko2iblnd]
[1927361.607016]  [<ffffffffc0a1e72a>] kiblnd_connd+0xfa/0xcb0 [ko2iblnd]
[1927361.613541]  [<ffffffff97ecc790>] ? wake_up_atomic_t+0x40/0x40
[1927361.619546]  [<ffffffffc0a1e630>] ? kiblnd_cm_callback+0x2140/0x2140 [ko2iblnd]
[1927361.627023]  [<ffffffff97ecb621>] kthread+0xd1/0xe0
[1927361.632074]  [<ffffffff97ecb550>] ? insert_kthread_work+0x40/0x40
[1927361.638774]  [<ffffffff985c51dd>] ret_from_fork_nospec_begin+0x7/0x21
[1927361.645387]  [<ffffffff97ecb550>] ? insert_kthread_work+0x40/0x40

Attaching vmcore-dmesg.txt as fir-io2-s1-20230724-vmcore-dmesg.txt


Generated at Sat Feb 10 03:31:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.