[LU-17207] LNet: race between monitor thread stopping and discovery thread PUSH may cause a crash Created: 17/Oct/23  Updated: 09/Nov/23  Resolved: 09/Nov/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Serguei Smirnov Assignee: Serguei Smirnov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-17103 sanity-lnet test_207: timed out Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This crash showed up in Janitor testing for https://review.whamcloud.com/#/c/fs/lustre-release/+/52522/ (LU-17103):

Crash with latest lustre function lnet_attach_rsp_tracker in backtrace called here: BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffffa01e3acc>] lnet_attach_rsp_tracker.isra.29+0xcc/0x1a0 [lnet]
PGD 0
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: ksocklnd(OE) lnet(OE) libcfs(OE) veth crc32_generic crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 squashfs pcspkr i2c_piix4 i2c_core binfmt_misc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix serio_raw libata [last unloaded: libcfs]
CPU: 3 PID: 8711 Comm: lnet_discovery Kdump: loaded Tainted: G           OE  ------------   3.10.0-7.9-debug #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
task: ffff8800aa96d550 ti: ffff8800b4348000 task.ti: ffff8800b4348000
RIP: 0010:[<ffffffffa01e3acc>]  [<ffffffffa01e3acc>] lnet_attach_rsp_tracker.isra.29+0xcc/0x1a0 [lnet]
RSP: 0018:ffff8800b434bcb8  EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000017
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8800b639d380
RBP: ffff8800b434bce8 R08: 00000000d5555b28 R09: 632e65766f6d2d62
R10: 0000000000000180 R11: ffff8800b434bb86 R12: ffff8800b639d380
R13: ffff8800b639d380 R14: 0000000000000000 R15: 0000007175e68a80
FS:  0000000000000000(0000) GS:ffff88013e380000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000000b6238000 CR4: 00000000000006e0
Call Trace:
 [<ffffffffa01eaf8a>] LNetPut+0x29a/0x9c0 [lnet]
 [<ffffffffa01fd7bc>] lnet_peer_send_push+0x2ec/0x440 [lnet]
 [<ffffffffa0206760>] ? lnet_discovery_event_reply+0xc70/0xc70 [lnet]
 [<ffffffffa0208488>] lnet_peer_discovery+0x4a8/0x1710 [lnet]
 [<ffffffff817e8dce>] ? _raw_spin_unlock_irq+0xe/0x30
 [<ffffffff817e60fa>] ? __schedule+0x32a/0x7d0
 [<ffffffff810bb2a0>] ? wake_up_atomic_t+0x30/0x30
 [<ffffffffa0207fe0>] ? lnet_peer_merge_data+0x1230/0x1230 [lnet]
 [<ffffffff810ba114>] kthread+0xe4/0xf0
 [<ffffffff810ba030>] ? kthread_create_on_node+0x140/0x140
 [<ffffffff817f3e5d>] ret_from_fork_nospec_begin+0x7/0x21
 [<ffffffff810ba030>] ? kthread_create_on_node+0x140/0x140

As pointed out by Chris Horn, this may be the result of discovery thread issuing a push while the monitor thread is stopping:

"Monitor thread is stopping and the ln_mt_resendqs are freed. Discovery then wakes and tries to issue push which attempts to dereference ln_mt_resendqs"


 Comments   
Comment by Gerrit Updater [ 17/Oct/23 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52734
Subject: LU-17207 lnet: race b/w monitor thr stop and discovery push
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 58f00213c811a05d1a96b44ef17b608c3fb883f6

Comment by Gerrit Updater [ 08/Nov/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52734/
Subject: LU-17207 lnet: race b/w monitor thr stop and discovery push
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 36b14a23a6e8240045074b097adfe01cb529d4a3

Comment by Peter Jones [ 09/Nov/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:33:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.