[LU-5271] NULL pointer dereference in kiblnd_tx_complete Created: 28/Jun/14 Updated: 23/Jul/14 Resolved: 23/Jul/14 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Matt Ezell | Assignee: | Liang Zhen (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL 6 2.6.32-358.23.2.el6.x86_64 Lustre 2.4.2 clients (with a couple patches). Mellanox OFED. Servers are 2.4.3 with patches. Routed environment. |
||
| Severity: | 3 |
| Rank (Obsolete): | 14707 |
| Description |
|
We are having clients crash on one of our clusters. We took a maintenance to add additional routers to the cluster. The BIOS, BMC, and FCB firmware versions were upgraded. Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: IP: [<ffffffffa0519711>] kiblnd_tx_complete+0x21/0x380 [ko2iblnd] Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: PGD 0 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Oops: 0000 [#1] SMP Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: last sysfs file: /sys/kernel/kexec_crash_loaded Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: CPU 9 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ipmi_devintf ko2iblnd(U) lnet(U) sha512_generic sha256_generic crc32c_intel libcfs(U) netconsole configfs nfsd exportfs autofs4 8021q garp stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_REJECT xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables rdma_ucm(U) ib_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) mlx5_ib(U) mlx5_core(U) mlx4_en(U) mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) mlx4_core(U) compat(U) dm_mirror dm_region_hash dm_log dm_mod uinput wmi acpi_pad dcdbas microcode ahci sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma ipv6 nfs lockd fscache auth_rpcgss nfs_acl sunrpc igb dca ptp pps_core [last unloaded: scsi_wait_scan] Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Pid: 5159, comm: kiblnd_sd_02_02 Not tainted 2.6.32-358.23.2.el6.x86_64 #1 Dell Inc. PowerEdge C6220/03C9JJ Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RIP: 0010:[<ffffffffa0519711>] [<ffffffffa0519711>] kiblnd_tx_complete+0x21/0x380 [ko2iblnd] Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RSP: 0018:ffff881064345df0 EFLAGS: 00010286 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RAX: 0000000000000000 RBX: ffff881055cdf600 RCX: 0000000000000000 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RDX: ffff881070ff6640 RSI: 0000000000000005 RDI: 0000000000000000 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RBP: ffff881064345e40 R08: 0000000000000001 R09: 0000000000000001 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: R10: 0000000000000000 R11: 0000000000000400 R12: ffff881070ff6640 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: R13: ffff881055cdf628 R14: 0000000000000001 R15: 0000000000000202 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: FS: 00007f93c032d700(0000) GS:ffff88089c420000(0000) knlGS:0000000000000000 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: CR2: 0000000000000020 CR3: 0000000001a85000 CR4: 00000000000407e0 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Process kiblnd_sd_02_02 (pid: 5159, threadinfo ffff881064344000, task ffff881067eda080) Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Stack: Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: 0000000000000001 0000000000000082 ffff881064345e40 ffffffff81055ad3 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: <d> 0000000164345e40 ffff881055cdf600 ffff881070ff6640 ffff881055cdf628 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: <d> 0000000000000001 0000000000000202 ffff881064345e50 ffffffffa051b4ae Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Call Trace: Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffffa051b4ae>] kiblnd_complete+0x2e/0xe0 [ko2iblnd] Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffffa051b8a8>] kiblnd_scheduler+0x348/0x7f0 [ko2iblnd] Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffff81063990>] ? default_wake_function+0x0/0x20 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffffa051b560>] ? kiblnd_scheduler+0x0/0x7f0 [ko2iblnd] Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffffa051b560>] ? kiblnd_scheduler+0x0/0x7f0 [ko2iblnd] Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffffa051b560>] ? kiblnd_scheduler+0x0/0x7f0 [ko2iblnd] Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Code: ff ff e9 f8 f8 ff ff 0f 1f 00 55 48 89 e5 48 83 ec 50 48 89 5d d8 4c 89 65 e0 4c 89 6d e8 4c 89 75 f0 4c 89 7d f8 0f 1f 44 00 00 <66> 83 7f 20 00 48 8b 5f 18 49 89 fc 0f 8e e5 02 00 00 45 31 ed Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RIP [<ffffffffa0519711>] kiblnd_tx_complete+0x21/0x380 [ko2iblnd] Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RSP <ffff881064345df0> Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: CR2: 0000000000000020 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: ---[ end trace cbe30c69da425fe8 ]--- Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Kernel panic - not syncing: Fatal exception Unfortunately we haven't gotten a kdump for this yet (we have an admin working on it). I tracked this back to the offending line (unfortunately this was on a different node that might have a slightly different Lustre version, so it's possible, but unlikely, that the symbols don't match up) (gdb) disassemble kiblnd_tx_complete Dump of assembler code for function kiblnd_tx_complete: 0x0000000000013750 <+0>: push %rbp 0x0000000000013751 <+1>: mov %rsp,%rbp 0x0000000000013754 <+4>: sub $0x50,%rsp 0x0000000000013758 <+8>: mov %rbx,-0x28(%rbp) 0x000000000001375c <+12>: mov %r12,-0x20(%rbp) 0x0000000000013760 <+16>: mov %r13,-0x18(%rbp) 0x0000000000013764 <+20>: mov %r14,-0x10(%rbp) 0x0000000000013768 <+24>: mov %r15,-0x8(%rbp) 0x000000000001376c <+28>: callq 0x13771 <kiblnd_tx_complete+33> 0x0000000000013771 <+33>: cmpw $0x0,0x20(%rdi) 0x0000000000013776 <+38>: mov 0x18(%rdi),%rbx 0x000000000001377a <+42>: mov %rdi,%r12 0x000000000001377d <+45>: jle 0x13a68 <kiblnd_tx_complete+792> 0x0000000000013783 <+51>: xor %r13d,%r13d (gdb) list *0x0000000000013771
0x13771 is in kiblnd_tx_complete (/usr/src/debug/lustre-2.4.3/lnet/klnds/o2iblnd/o2iblnd_cb.c:1003).
warning: Source file is more recent than executable.
998 {
999 int failed = (status != IB_WC_SUCCESS);
1000 kib_conn_t *conn = tx->tx_conn;
1001 int idle;
1002
1003 LASSERT (tx->tx_sending > 0);
1004
1005 if (failed) {
1006 if (conn->ibc_state == IBLND_CONN_ESTABLISHED)
1007 CNETERR("Tx -> %s cookie "LPX64
'tx' appears to be NULL. It comes in as an argument to the function: void kiblnd_tx_complete (kib_tx_t *tx, int status)
called as kiblnd_tx_complete(kiblnd_wreqid2ptr(wc->wr_id), wc->status); kiblnd_wreq2ptr is a simple wrapper that masks out the last 2 bits: static inline void * kiblnd_wreqid2ptr (__u64 wreqid) { return (void *)(((unsigned long)wreqid) & ~IBLND_WID_MASK); } Somehow it seems that wc->wr_id is zero. I'm going to keep looking, but we'd appreciate it if someone with more familiarity with ko2iblnd and OFED could take a look. |
| Comments |
| Comment by Peter Jones [ 28/Jun/14 ] |
|
Liang Could you please comment? Thanks Peter |
| Comment by Matt Ezell [ 29/Jun/14 ] |
|
I have uploaded a tarball containing a vmcore to /uploads/ The node didn't have kernel-debuginfo installed, but I found the correct RPM in on our local mirror. I extracted the vmlinux file, but crash doesn't seem to want to open it. If you have any problems, let me know. Thanks. |
| Comment by Matt Ezell [ 29/Jun/14 ] |
|
I had the wrong vmlinux file. I uploaded /uploads/ |
| Comment by Matt Ezell [ 29/Jun/14 ] |
|
We also saw a single: Jun 28 19:23:45 rhea262.ccs.ornl.gov kernel: LNetError: 5452:0:(o2iblnd_cb.c:3258:kiblnd_complete()) LBUG due to the switch (kiblnd_wreqid2type(wc->wr_id))
not matching and falling to the default case, which LBUGs. |
| Comment by Matt Ezell [ 29/Jun/14 ] |
crash> xbt
#0
mod NONE, name schedule, RIP 0xffffffff8150e172
frame start 0xffff881047669d80, end 0xffff881047669dc8, *base 0xffffffffffffffff
xbt: cannot find debuginfo for module 'NONE'
#1
mod ko2iblnd, name kiblnd_tx_complete, RIP 0xffffffffa0521711
frame start 0xffff881047669dc8, end 0xffff881047669e48, *base 0xffff881047669e50
#2
mod ko2iblnd, name kiblnd_complete, RIP 0xffffffffa05234ae
frame start 0xffff881047669e48, end 0xffff881047669e58, *base 0xffff881047669f40
XBT_RBX = ffff880825098200
XBT_R12 = ffff880822cb6cc0
XBT_R13 = ffff880825098228
XBT_R14 = 1
XBT_R15 = 202
#3
mod ko2iblnd, name kiblnd_scheduler, RIP 0xffffffffa05238a8
frame start 0xffff881047669e58, end 0xffff881047669f48, *base 0xffff880825f39be8
sched = ffff880822cb6ca8
&wait = ffff881047669eb8
wait = 0 ...
&wc = ffff881047669e78
wc = 100000000 ...
&name = ffff881047669ee8
name = ffff881047669f48 ...
did_something = 0
&busy_loops = ffff881047669e74
busy_loops = ffff8808 ...
wc = ffff881047669e78
num_entries = 1
subsystem = 800
mask = 200
I don't have a debug version of the Mellanox IB modules, but I pulled in the inkernel ib_core debug module. I don't expect ib_wc or many of the constants to be any different. crash> struct ib_wc ffff881047669e78
struct ib_wc {
wr_id = 4294967296,
status = IB_WC_SUCCESS,
opcode = IB_WC_SEND,
vendor_err = 5,
byte_len = 0,
qp = 0xb8000000f9,
ex = {
imm_data = 1872179200,
invalidate_rkey = 1872179200
},
src_qp = 4294936592,
wc_flags = 0,
pkey_index = 53505,
slid = 0,
sl = 0 '\000',
dlid_path_bits = 0 '\000',
port_num = 0 '\000'
}
|
| Comment by Liang Zhen (Inactive) [ 30/Jun/14 ] |
|
Hi Matt, thanks for your information, this is strange, especially in the second crash, which hit this: it means low bits of wr_id is 3, but ko2iblnd will never set low bits of wr_id to 3 even something wrong happened, so I suspect OFED returned invalid wr_id for unknown reason. Also, In your previous comment: wr_id = 4294967296, status = IB_WC_SUCCESS, opcode = IB_WC_SEND, vendor_err = 5, byte_len = 0, qp = 0xb8000000f9, A few findings from this structure:
A few more questions may be helpful:
|
| Comment by Matt Ezell [ 30/Jun/14 ] |
|
We'll get a question out to Mellanox about vendor error 5. If I had to guess from the source code, I'd say it was MLX5_CQE_SYNDROME_WR_FLUSH_ERR, but that should set the status to IB_WC_WR_FLUSH_ERR. Not directly related to this issue, but I wonder, if the status is NOT IB_WC_SUCCESS, can we assume that the wr_id is valid? Is it safe to dereference tx in the LASSERT before checking 'failed'? |
| Comment by Matt Ezell [ 30/Jun/14 ] |
|
| Comment by Matt Ezell [ 01/Jul/14 ] |
|
According to Mellanox:
|
| Comment by Liang Zhen (Inactive) [ 01/Jul/14 ] |
|
Hi Matt, thanks for getting this information. opcode = cqe64->op_own >> 4;
switch (opcode) {
case MLX5_CQE_REQ:
...
case MLX5_CQE_RESP_WR_IMM:
....
case MLX5_CQE_RESP_ERR:
...
}
ib_wc:wr_id should always be set in this "switch", however, there is no "default" case in this switch, which means if opcode is an unknown value for some reasons (needs to confirm with Mellanox if it is possible), then ib_wc::wr_id will not be set at all. At the meanwhile, from description of ib_poll_cq, ib_wc::wr_id is supposed to be valid if ib_poll_cq returned +ve, even when the completion is failed (ib_wc::status != IB_WC_SUCCESS), and this is the reason that there is no check for it in o2iblnd. If this truly happened, then kiblnd_complete will use random number from stack as pointer of tx/rx, which can explain these different crashes. I posted a debug patch for this (http://review.whamcloud.com/10912), it could be helpful if you can try with it, if the crash is replaced by the new assertion failure added by this patch, then we can verify my conjecture, and we only need to change the LASSERTF to a CWARN as solution. |
| Comment by Matt Ezell [ 01/Jul/14 ] |
|
Thanks for the analysis, Liang. Since 'wc' is initialized on the stack and reused in the loop, wouldn't we expect valid-but-stale pointers (from the previous completion event) if the IB driver wasn't setting the wr_id? It seems like only the first iteration might have "random" data. Since we don't get this crash immediately after boot, I highly doubt it's during the first iteration, even with multiple kiblnd_scheduler threads running. |
| Comment by Liang Zhen (Inactive) [ 02/Jul/14 ] |
|
Hi Matt, o2iblnd will hash different NIDs to different scheduler threads, which means it is possible that a thread is scheduled only after adding some new peers (e.g. new routers). It may be difficult to try, but if possible, can this issue be reproduced when w/o those new routers? Anyway, I agree with you, most likely, it should be a stale pointer if this truly happened, it can actually explain what happened in I also added more debug information into my patch, it can also provide some helpful information even if my assumption is wrong. Another question is, you said roll back BIOS can avoid this issue? if so, it more like a low level stack bug to me. |
| Comment by Matt Ezell [ 03/Jul/14 ] |
|
When we went back to using just the original routers, the problem went away. We have not yet determined the difference between the old and new routers - BIOS version, IB card model, IB card firmware, etc are all the same. They boot the same NFSRoot image, so the OFED version and Kernel version are exactly the same. We are looking into the fabric to see if the new routers got assigned weird routes, possibly through older switches that only support 2K MTU (we've seen this before). We have a small pool of compute nodes set aside to test with just the new routers, and attempt to narrow down the specific problem. Next week we will try your debug patch, as long as this small testbed is still having problems. Also, this is using mlx4, not mlx5 as I stated above. Sorry for any confusion. Thanks again for your help, Liang. |
| Comment by James Nunez (Inactive) [ 23/Jul/14 ] |
|
Matt, Do you have new information from the debug patch or should we close this ticket since you believe that this issue was caused by hardware? Thanks, |
| Comment by Matt Ezell [ 23/Jul/14 ] |
|
At this point we believe it was a hardware issue. Let's close it. If we get a chance to try to debug patch and find anything useful, we'll report back here. Thanks. |