Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.16.0
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This crash showed up in Janitor testing for https://review.whamcloud.com/#/c/fs/lustre-release/+/52522/ (~~LU-17103~~):

Crash with latest lustre function lnet_attach_rsp_tracker in backtrace called here: BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffffa01e3acc>] lnet_attach_rsp_tracker.isra.29+0xcc/0x1a0 [lnet]
PGD 0
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: ksocklnd(OE) lnet(OE) libcfs(OE) veth crc32_generic crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 squashfs pcspkr i2c_piix4 i2c_core binfmt_misc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix serio_raw libata [last unloaded: libcfs]
CPU: 3 PID: 8711 Comm: lnet_discovery Kdump: loaded Tainted: G           OE  ------------   3.10.0-7.9-debug #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
task: ffff8800aa96d550 ti: ffff8800b4348000 task.ti: ffff8800b4348000
RIP: 0010:[<ffffffffa01e3acc>]  [<ffffffffa01e3acc>] lnet_attach_rsp_tracker.isra.29+0xcc/0x1a0 [lnet]
RSP: 0018:ffff8800b434bcb8  EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000017
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8800b639d380
RBP: ffff8800b434bce8 R08: 00000000d5555b28 R09: 632e65766f6d2d62
R10: 0000000000000180 R11: ffff8800b434bb86 R12: ffff8800b639d380
R13: ffff8800b639d380 R14: 0000000000000000 R15: 0000007175e68a80
FS:  0000000000000000(0000) GS:ffff88013e380000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000000b6238000 CR4: 00000000000006e0
Call Trace:
 [<ffffffffa01eaf8a>] LNetPut+0x29a/0x9c0 [lnet]
 [<ffffffffa01fd7bc>] lnet_peer_send_push+0x2ec/0x440 [lnet]
 [<ffffffffa0206760>] ? lnet_discovery_event_reply+0xc70/0xc70 [lnet]
 [<ffffffffa0208488>] lnet_peer_discovery+0x4a8/0x1710 [lnet]
 [<ffffffff817e8dce>] ? _raw_spin_unlock_irq+0xe/0x30
 [<ffffffff817e60fa>] ? __schedule+0x32a/0x7d0
 [<ffffffff810bb2a0>] ? wake_up_atomic_t+0x30/0x30
 [<ffffffffa0207fe0>] ? lnet_peer_merge_data+0x1230/0x1230 [lnet]
 [<ffffffff810ba114>] kthread+0xe4/0xf0
 [<ffffffff810ba030>] ? kthread_create_on_node+0x140/0x140
 [<ffffffff817f3e5d>] ret_from_fork_nospec_begin+0x7/0x21
 [<ffffffff810ba030>] ? kthread_create_on_node+0x140/0x140

As pointed out by Chris Horn, this may be the result of discovery thread issuing a push while the monitor thread is stopping:

"Monitor thread is stopping and the ln_mt_resendqs are freed. Discovery then wakes and tries to issue push which attempts to dereference ln_mt_resendqs"

Attachments

Issue Links

is related to

LU-17103 sanity-lnet test_207: timed out

Resolved

Activity

People

Assignee:: Serguei Smirnov

Reporter:: Serguei Smirnov

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 17/Oct/23 6:25 PM

Updated:: 09/Nov/23 8:39 AM

Resolved:: 09/Nov/23 12:45 AM