[LU-7324] Race condition on deleting lnet_msg - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.5.3
Labels:
- p4b
- patch
Environment:
bull lustre 2.5.3.90

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We hit several times on different OSS what looks like a race condition on freeing an lnet_msg.
The crash looks as follows:

LustreError: 25277:0:(lu_object.c:1463:key_fini()) ASSERTION( atomic_read(&key->lct_used) > 1 ) failed:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffffa0602924>] lnet_ptl_match_md+0x3b4/0x870 [lnet]
PGD 0
Oops: 0002 1 SMP

with this stack:

crash> bt
PID: 2163   TASK: ffff8803b9df2040 CPU: 25 COMMAND: "kiblnd_sd_01_02"
#0 [ffff880340ceb7e0] machine_kexec at ffffffff8103d30b
#1 [ffff880340ceb840] crash_kexec at ffffffff810cc4f2
#2 [ffff880340ceb910] oops_end at ffffffff8153d3d0
#3 [ffff880340ceb940] no_context at ffffffff8104e8cb
#4 [ffff880340ceb990] __bad_area_nosemaphore at ffffffff8104eb55
#5 [ffff880340ceb9e0] bad_area_nosemaphore at ffffffff8104ec23
#6 [ffff880340ceb9f0] __do_page_fault at ffffffff8104f31c
#7 [ffff880340cebb10] do_page_fault at ffffffff8153f31e
#8 [ffff880340cebb40] page_fault at ffffffff8153c6c5
    [exception RIP: lnet_ptl_match_md+948]
    RIP: ffffffffa0602924 RSP: ffff880340cebbf0 RFLAGS: 00010202
    RAX: 0000000000000000 RBX: ffff880340cebcf0 RCX: ffff880c75631940
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff880bc4ed4550
    RBP: ffff880340cebc70   R8: ffff880bc4ed4550   R9: a500000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff880749841800
    R13: 0000000000000002 R14: ffff881078f8b2c0 R15: 0000000000000002
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff880340cebc78] lnet_parse at ffffffffa0609fb3 [lnet]
#10 [ffff880340cebd58] kiblnd_handle_rx at ffffffffa09f19db [ko2iblnd]
#11 [ffff880340cebda8] kiblnd_rx_complete at ffffffffa09f26c3 [ko2iblnd]
#12 [ffff880340cebdf8] kiblnd_complete at ffffffffa09f2872 [ko2iblnd]
#13 [ffff880340cebe08] kiblnd_scheduler at ffffffffa09f2c2a [ko2iblnd]
#14 [ffff880340cebee8] kthread at ffffffff810a101e
#15 [ffff880340cebf48] kernel_thread at ffffffff8100c28a

The crash seems to occur here:
lnet_ptl_match_delay(struct lnet_portal *ptl,
             struct lnet_match_info *info, struct lnet_msg *msg)
{
...
        if (!cfs_list_empty(&msg->msg_list)) { /* on stealing list */
            rc = lnet_mt_match_md(mtable, info, msg);

            if ((rc & LNET_MATCHMD_EXHAUSTED) != 0 &&
                mtable->mt_enabled)
                lnet_ptl_disable_mt(ptl, cpt);

if ((rc & LNET_MATCHMD_FINISH) != 0)
cfs_list_del_init(&msg->msg_list); <=== CRASH (msg->msg_list == NULL)
....

Can you please help with analyzing what can cause the race ?

Attachments

Activity

People

Assignee:: Henri Doreau (Inactive)

Reporter:: Sebastien Piechurski

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 21/Oct/15 2:08 PM

Updated:: 27/Jan/17 11:32 PM

Resolved:: 25/Jan/16 3:26 PM