Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
This crash showed up in Janitor testing for https://review.whamcloud.com/#/c/fs/lustre-release/+/52522/ (LU-17103):
Crash with latest lustre function lnet_attach_rsp_tracker in backtrace called here: BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffffa01e3acc>] lnet_attach_rsp_tracker.isra.29+0xcc/0x1a0 [lnet]
PGD 0
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: ksocklnd(OE) lnet(OE) libcfs(OE) veth crc32_generic crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 squashfs pcspkr i2c_piix4 i2c_core binfmt_misc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi ata_piix serio_raw libata [last unloaded: libcfs]
CPU: 3 PID: 8711 Comm: lnet_discovery Kdump: loaded Tainted: G OE ------------ 3.10.0-7.9-debug #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
task: ffff8800aa96d550 ti: ffff8800b4348000 task.ti: ffff8800b4348000
RIP: 0010:[<ffffffffa01e3acc>] [<ffffffffa01e3acc>] lnet_attach_rsp_tracker.isra.29+0xcc/0x1a0 [lnet]
RSP: 0018:ffff8800b434bcb8 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000017
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8800b639d380
RBP: ffff8800b434bce8 R08: 00000000d5555b28 R09: 632e65766f6d2d62
R10: 0000000000000180 R11: ffff8800b434bb86 R12: ffff8800b639d380
R13: ffff8800b639d380 R14: 0000000000000000 R15: 0000007175e68a80
FS: 0000000000000000(0000) GS:ffff88013e380000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000000b6238000 CR4: 00000000000006e0
Call Trace:
[<ffffffffa01eaf8a>] LNetPut+0x29a/0x9c0 [lnet]
[<ffffffffa01fd7bc>] lnet_peer_send_push+0x2ec/0x440 [lnet]
[<ffffffffa0206760>] ? lnet_discovery_event_reply+0xc70/0xc70 [lnet]
[<ffffffffa0208488>] lnet_peer_discovery+0x4a8/0x1710 [lnet]
[<ffffffff817e8dce>] ? _raw_spin_unlock_irq+0xe/0x30
[<ffffffff817e60fa>] ? __schedule+0x32a/0x7d0
[<ffffffff810bb2a0>] ? wake_up_atomic_t+0x30/0x30
[<ffffffffa0207fe0>] ? lnet_peer_merge_data+0x1230/0x1230 [lnet]
[<ffffffff810ba114>] kthread+0xe4/0xf0
[<ffffffff810ba030>] ? kthread_create_on_node+0x140/0x140
[<ffffffff817f3e5d>] ret_from_fork_nospec_begin+0x7/0x21
[<ffffffff810ba030>] ? kthread_create_on_node+0x140/0x140
As pointed out by Chris Horn, this may be the result of discovery thread issuing a push while the monitor thread is stopping:
"Monitor thread is stopping and the ln_mt_resendqs are freed. Discovery then wakes and tries to issue push which attempts to dereference ln_mt_resendqs"
Attachments
Issue Links
- is related to
-
LU-17103 sanity-lnet test_207: timed out
- Resolved