[LU-13929] NULL pointer dereference in lnet_post_send_locked Created: 26/Aug/20  Updated: 26/Aug/22  Resolved: 26/Feb/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.4
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Critical
Reporter: Olaf Faaland Assignee: Serguei Smirnov
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

kmod-lustre-2.12.4_5.chaos
3.10.0-1127.0.0.1chaos.ch6.x86_64


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

During an lnet shutdown, the router crashes with the following stack:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
IP: [<ffffffffc0d43bda>] lnet_post_send_locked+0x7a/0xa40 [lnet]
PGD 0 
Oops: 0000 [#1] SMP 
CPU: 1 PID: 19931 Comm: kiblnd_sd_00_02 Kdump: loaded Tainted: G        W  OE  ------------ T 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1
Hardware name: Cray Inc. SERVER-1824X-GN/S2600GZ, BIOS SE5C600.86B.02.06.0002.101320150901 10/13/2015
task: ffff8e581c0562a0 ti: ffff8e601d554000 task.ti: ffff8e601d554000
RIP: 0010:[<ffffffffc0d43bda>]  [<ffffffffc0d43bda>] lnet_post_send_locked+0x7a/0xa40 [lnet]
RSP: 0018:ffff8e601d557c50  EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff8e5f8666fc00 RCX: dead000000000200
RDX: ffff8e5eb083e810 RSI: 0000000000000001 RDI: ffff8e5f8666fc00
RBP: ffff8e601d557cb0 R08: ffff8e5f8666fc10 R09: ffff8e5174561070
R10: 0000000000000008 R11: ffff8e58154e48b8 R12: 0000000000000000
R13: ffff8e6017c3a800 R14: ffff8e58170fa100 R15: ffff8e57f5a94000
FS:  0000000000000000(0000) GS:ffff8e581ea40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000038 CR3: 000000101ae34000 CR4: 00000000000607e0
Call Trace:
 [<ffffffffc0d46238>] lnet_return_tx_credits_locked+0x238/0x4a0 [lnet]
 [<ffffffffc0d3895c>] lnet_msg_decommit+0xec/0x700 [lnet]
 [<ffffffffc0d39a2c>] lnet_finalize+0x34c/0xd40 [lnet]
 [<ffffffffc0aa875d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]
 [<ffffffffc0ab3d19>] kiblnd_scheduler+0x8c9/0x1160 [ko2iblnd]
 [<ffffffffaac2d59e>] ? __switch_to+0xce/0x5a0
 [<ffffffffaace29b0>] ? wake_up_state+0x20/0x20
 [<ffffffffc0ab3450>] ? kiblnd_cq_event+0x90/0x90 [ko2iblnd]
 [<ffffffffaaccca01>] kthread+0xd1/0xe0
 [<ffffffffaaccc930>] ? insert_kthread_work+0x40/0x40
 [<ffffffffab3bff77>] ret_from_fork_nospec_begin+0x21/0x21
 [<ffffffffaaccc930>] ? insert_kthread_work+0x40/0x40
Code: 45 c0 49 8b 45 50 4e 8b 34 e0 0f 85 8e 08 00 00 80 7b 72 00 0f 88 52 08 00 00 f6 43 6d 01 0f 84 b6 08 00 00 48 8b 05 56 74 03 00 <48> 8b 40 38 49 39 87 20 01 00 00 0f 84 fe 07 00 00 8b 7b 28 85 
RIP  [<ffffffffc0d43bda>] lnet_post_send_locked+0x7a/0xa40 [lnet]
 RSP <ffff8e601d557c50>
CR2: 0000000000000038


 Comments   
Comment by Olaf Faaland [ 26/Aug/20 ]

The relevant code:

(gdb) l *(lnet_post_send_locked+0x7a)
0x1cc0a is in lnet_post_send_locked (/usr/src/debug/lustre-2.12.4_5.chaos/lnet/lnet/lib-move.c:942).
937		/* non-lnet_send() callers have checked before */
938		LASSERT(!do_send || msg->msg_tx_delayed);
939		LASSERT(!msg->msg_receiving);
940		LASSERT(msg->msg_tx_committed);
941		/* can't get here if we're sending to the loopback interface */
942		LASSERT(lp->lpni_nid != the_lnet.ln_loni->ni_nid);
943	
944		/* NB 'lp' is always the next hop */
945		if ((msg->msg_target.pid & LNET_PID_USERFLAG) == 0 &&
946		    lnet_peer_alive_locked(ni, lp, msg) == 0) {
Comment by Olaf Faaland [ 26/Aug/20 ]

This problem is not currently a priority for us. Just posted so there's a record in case others run into it.

Comment by Peter Jones [ 27/Aug/20 ]

Serguei

Could you please investigate

Thanks

Peter

Comment by Matt Rásó-Barnett (Inactive) [ 13/Nov/20 ]

Thanks for logging this, I just hit this today on 2.12.5, exact same situation, during an lnet shutdown on our LNET routers.

kernel-3.10.0-1127.19.1.el7.x86_64
lustre-client-2.12.5-1.el7.x86_64

Comment by Gerrit Updater [ 25/Nov/20 ]

Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40749
Subject: LU-13929 lnet: replace assertion in lnet_post_send_locked
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 00ef477710f541d31c5a21fa2d4ec9cf0f45ff14

Comment by Gerrit Updater [ 25/Nov/20 ]

Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40759
Subject: LU-13929 lnet: modify assertion in lnet_post_send_locked
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cc623ea3f49633cfaafa0b8f044c3b28cd664b4f

Comment by Gerrit Updater [ 26/Feb/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40749/
Subject: LU-13929 lnet: modify assertion in lnet_post_send_locked
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e5a8f3fc12840aee97fca03d76b1ae9b4572acb8

Comment by Peter Jones [ 26/Feb/21 ]

Landed for 2.15

Generated at Sat Feb 10 03:05:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.