Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.5.3
-
bull lustre 2.5.3.90
-
3
-
9223372036854775807
Description
We hit several times on different OSS what looks like a race condition on freeing an lnet_msg.
The crash looks as follows:
LustreError: 25277:0:(lu_object.c:1463:key_fini()) ASSERTION( atomic_read(&key->lct_used) > 1 ) failed:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffffa0602924>] lnet_ptl_match_md+0x3b4/0x870 [lnet]
PGD 0
Oops: 0002 1 SMP
with this stack:
crash> bt
PID: 2163 TASK: ffff8803b9df2040 CPU: 25 COMMAND: "kiblnd_sd_01_02"
#0 [ffff880340ceb7e0] machine_kexec at ffffffff8103d30b
#1 [ffff880340ceb840] crash_kexec at ffffffff810cc4f2
#2 [ffff880340ceb910] oops_end at ffffffff8153d3d0
#3 [ffff880340ceb940] no_context at ffffffff8104e8cb
#4 [ffff880340ceb990] __bad_area_nosemaphore at ffffffff8104eb55
#5 [ffff880340ceb9e0] bad_area_nosemaphore at ffffffff8104ec23
#6 [ffff880340ceb9f0] __do_page_fault at ffffffff8104f31c
#7 [ffff880340cebb10] do_page_fault at ffffffff8153f31e
#8 [ffff880340cebb40] page_fault at ffffffff8153c6c5
[exception RIP: lnet_ptl_match_md+948]
RIP: ffffffffa0602924 RSP: ffff880340cebbf0 RFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff880340cebcf0 RCX: ffff880c75631940
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff880bc4ed4550
RBP: ffff880340cebc70 R8: ffff880bc4ed4550 R9: a500000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880749841800
R13: 0000000000000002 R14: ffff881078f8b2c0 R15: 0000000000000002
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff880340cebc78] lnet_parse at ffffffffa0609fb3 [lnet]
#10 [ffff880340cebd58] kiblnd_handle_rx at ffffffffa09f19db [ko2iblnd]
#11 [ffff880340cebda8] kiblnd_rx_complete at ffffffffa09f26c3 [ko2iblnd]
#12 [ffff880340cebdf8] kiblnd_complete at ffffffffa09f2872 [ko2iblnd]
#13 [ffff880340cebe08] kiblnd_scheduler at ffffffffa09f2c2a [ko2iblnd]
#14 [ffff880340cebee8] kthread at ffffffff810a101e
#15 [ffff880340cebf48] kernel_thread at ffffffff8100c28a
The crash seems to occur here:
lnet_ptl_match_delay(struct lnet_portal *ptl,
struct lnet_match_info *info, struct lnet_msg *msg)
{
...
if (!cfs_list_empty(&msg->msg_list)) { /* on stealing list */
rc = lnet_mt_match_md(mtable, info, msg);
if ((rc & LNET_MATCHMD_EXHAUSTED) != 0 &&
mtable->mt_enabled)
lnet_ptl_disable_mt(ptl, cpt);
if ((rc & LNET_MATCHMD_FINISH) != 0)
cfs_list_del_init(&msg->msg_list); <=== CRASH (msg->msg_list == NULL)
....
Can you please help with analyzing what can cause the race ?