[LU-13929] NULL pointer dereference in lnet_post_send_locked Created: 26/Aug/20 Updated: 26/Aug/22 Resolved: 26/Feb/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.4 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Olaf Faaland | Assignee: | Serguei Smirnov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
kmod-lustre-2.12.4_5.chaos |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
During an lnet shutdown, the router crashes with the following stack: BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 IP: [<ffffffffc0d43bda>] lnet_post_send_locked+0x7a/0xa40 [lnet] PGD 0 Oops: 0000 [#1] SMP CPU: 1 PID: 19931 Comm: kiblnd_sd_00_02 Kdump: loaded Tainted: G W OE ------------ T 3.10.0-1127.0.0.1chaos.ch6.x86_64 #1 Hardware name: Cray Inc. SERVER-1824X-GN/S2600GZ, BIOS SE5C600.86B.02.06.0002.101320150901 10/13/2015 task: ffff8e581c0562a0 ti: ffff8e601d554000 task.ti: ffff8e601d554000 RIP: 0010:[<ffffffffc0d43bda>] [<ffffffffc0d43bda>] lnet_post_send_locked+0x7a/0xa40 [lnet] RSP: 0018:ffff8e601d557c50 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff8e5f8666fc00 RCX: dead000000000200 RDX: ffff8e5eb083e810 RSI: 0000000000000001 RDI: ffff8e5f8666fc00 RBP: ffff8e601d557cb0 R08: ffff8e5f8666fc10 R09: ffff8e5174561070 R10: 0000000000000008 R11: ffff8e58154e48b8 R12: 0000000000000000 R13: ffff8e6017c3a800 R14: ffff8e58170fa100 R15: ffff8e57f5a94000 FS: 0000000000000000(0000) GS:ffff8e581ea40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000038 CR3: 000000101ae34000 CR4: 00000000000607e0 Call Trace: [<ffffffffc0d46238>] lnet_return_tx_credits_locked+0x238/0x4a0 [lnet] [<ffffffffc0d3895c>] lnet_msg_decommit+0xec/0x700 [lnet] [<ffffffffc0d39a2c>] lnet_finalize+0x34c/0xd40 [lnet] [<ffffffffc0aa875d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd] [<ffffffffc0ab3d19>] kiblnd_scheduler+0x8c9/0x1160 [ko2iblnd] [<ffffffffaac2d59e>] ? __switch_to+0xce/0x5a0 [<ffffffffaace29b0>] ? wake_up_state+0x20/0x20 [<ffffffffc0ab3450>] ? kiblnd_cq_event+0x90/0x90 [ko2iblnd] [<ffffffffaaccca01>] kthread+0xd1/0xe0 [<ffffffffaaccc930>] ? insert_kthread_work+0x40/0x40 [<ffffffffab3bff77>] ret_from_fork_nospec_begin+0x21/0x21 [<ffffffffaaccc930>] ? insert_kthread_work+0x40/0x40 Code: 45 c0 49 8b 45 50 4e 8b 34 e0 0f 85 8e 08 00 00 80 7b 72 00 0f 88 52 08 00 00 f6 43 6d 01 0f 84 b6 08 00 00 48 8b 05 56 74 03 00 <48> 8b 40 38 49 39 87 20 01 00 00 0f 84 fe 07 00 00 8b 7b 28 85 RIP [<ffffffffc0d43bda>] lnet_post_send_locked+0x7a/0xa40 [lnet] RSP <ffff8e601d557c50> CR2: 0000000000000038 |
| Comments |
| Comment by Olaf Faaland [ 26/Aug/20 ] |
|
The relevant code: (gdb) l *(lnet_post_send_locked+0x7a)
0x1cc0a is in lnet_post_send_locked (/usr/src/debug/lustre-2.12.4_5.chaos/lnet/lnet/lib-move.c:942).
937 /* non-lnet_send() callers have checked before */
938 LASSERT(!do_send || msg->msg_tx_delayed);
939 LASSERT(!msg->msg_receiving);
940 LASSERT(msg->msg_tx_committed);
941 /* can't get here if we're sending to the loopback interface */
942 LASSERT(lp->lpni_nid != the_lnet.ln_loni->ni_nid);
943
944 /* NB 'lp' is always the next hop */
945 if ((msg->msg_target.pid & LNET_PID_USERFLAG) == 0 &&
946 lnet_peer_alive_locked(ni, lp, msg) == 0) {
|
| Comment by Olaf Faaland [ 26/Aug/20 ] |
|
This problem is not currently a priority for us. Just posted so there's a record in case others run into it. |
| Comment by Peter Jones [ 27/Aug/20 ] |
|
Serguei Could you please investigate Thanks Peter |
| Comment by Matt Rásó-Barnett (Inactive) [ 13/Nov/20 ] |
|
Thanks for logging this, I just hit this today on 2.12.5, exact same situation, during an lnet shutdown on our LNET routers. kernel-3.10.0-1127.19.1.el7.x86_64 |
| Comment by Gerrit Updater [ 25/Nov/20 ] |
|
Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40749 |
| Comment by Gerrit Updater [ 25/Nov/20 ] |
|
Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40759 |
| Comment by Gerrit Updater [ 26/Feb/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40749/ |
| Comment by Peter Jones [ 26/Feb/21 ] |
|
Landed for 2.15 |