[LU-5271] NULL pointer dereference in kiblnd_tx_complete Created: 28/Jun/14  Updated: 23/Jul/14  Resolved: 23/Jul/14

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Matt Ezell Assignee: Liang Zhen (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

RHEL 6 2.6.32-358.23.2.el6.x86_64 Lustre 2.4.2 clients (with a couple patches). Mellanox OFED. Servers are 2.4.3 with patches. Routed environment.


Severity: 3
Rank (Obsolete): 14707

 Description   

We are having clients crash on one of our clusters. We took a maintenance to add additional routers to the cluster. The BIOS, BMC, and FCB firmware versions were upgraded.

Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: IP: [<ffffffffa0519711>] kiblnd_tx_complete+0x21/0x380 [ko2iblnd]
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: PGD 0 
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Oops: 0000 [#1] SMP 
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: last sysfs file: /sys/kernel/kexec_crash_loaded
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: CPU 9 
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ipmi_devintf ko2iblnd(U) lnet(U) sha512_generic sha256_generic crc32c_intel libcfs(U) netconsole configfs nfsd exportfs autofs4 8021q garp stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_REJECT xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables rdma_ucm(U) ib_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) mlx5_ib(U) mlx5_core(U) mlx4_en(U) mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) mlx4_core(U) compat(U) dm_mirror dm_region_hash dm_log dm_mod uinput wmi acpi_pad dcdbas microcode ahci sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma ipv6 nfs lockd fscache auth_rpcgss nfs_acl sunrpc igb dca ptp pps_core [last unloaded: scsi_wait_scan]
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: 
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Pid: 5159, comm: kiblnd_sd_02_02 Not tainted 2.6.32-358.23.2.el6.x86_64 #1 Dell Inc. PowerEdge C6220/03C9JJ
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RIP: 0010:[<ffffffffa0519711>]  [<ffffffffa0519711>] kiblnd_tx_complete+0x21/0x380 [ko2iblnd]
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RSP: 0018:ffff881064345df0  EFLAGS: 00010286
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RAX: 0000000000000000 RBX: ffff881055cdf600 RCX: 0000000000000000
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RDX: ffff881070ff6640 RSI: 0000000000000005 RDI: 0000000000000000
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RBP: ffff881064345e40 R08: 0000000000000001 R09: 0000000000000001
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: R10: 0000000000000000 R11: 0000000000000400 R12: ffff881070ff6640
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: R13: ffff881055cdf628 R14: 0000000000000001 R15: 0000000000000202
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: FS:  00007f93c032d700(0000) GS:ffff88089c420000(0000) knlGS:0000000000000000
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: CR2: 0000000000000020 CR3: 0000000001a85000 CR4: 00000000000407e0
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Process kiblnd_sd_02_02 (pid: 5159, threadinfo ffff881064344000, task ffff881067eda080)
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Stack:
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: 0000000000000001 0000000000000082 ffff881064345e40 ffffffff81055ad3
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: <d> 0000000164345e40 ffff881055cdf600 ffff881070ff6640 ffff881055cdf628
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: <d> 0000000000000001 0000000000000202 ffff881064345e50 ffffffffa051b4ae
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Call Trace:
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffffa051b4ae>] kiblnd_complete+0x2e/0xe0 [ko2iblnd]
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffffa051b8a8>] kiblnd_scheduler+0x348/0x7f0 [ko2iblnd]
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffff81063990>] ? default_wake_function+0x0/0x20
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffffa051b560>] ? kiblnd_scheduler+0x0/0x7f0 [ko2iblnd]
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffffa051b560>] ? kiblnd_scheduler+0x0/0x7f0 [ko2iblnd]
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffffa051b560>] ? kiblnd_scheduler+0x0/0x7f0 [ko2iblnd]
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Code: ff ff e9 f8 f8 ff ff 0f 1f 00 55 48 89 e5 48 83 ec 50 48 89 5d d8 4c 89 65 e0 4c 89 6d e8 4c 89 75 f0 4c 89 7d f8 0f 1f 44 00 00 <66> 83 7f 20 00 48 8b 5f 18 49 89 fc 0f 8e e5 02 00 00 45 31 ed 
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RIP  [<ffffffffa0519711>] kiblnd_tx_complete+0x21/0x380 [ko2iblnd]
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RSP <ffff881064345df0>
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: CR2: 0000000000000020
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: ---[ end trace cbe30c69da425fe8 ]---
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Kernel panic - not syncing: Fatal exception

Unfortunately we haven't gotten a kdump for this yet (we have an admin working on it). I tracked this back to the offending line (unfortunately this was on a different node that might have a slightly different Lustre version, so it's possible, but unlikely, that the symbols don't match up)

(gdb) disassemble kiblnd_tx_complete
Dump of assembler code for function kiblnd_tx_complete:
   0x0000000000013750 <+0>:     push   %rbp
   0x0000000000013751 <+1>:     mov    %rsp,%rbp 
   0x0000000000013754 <+4>:     sub    $0x50,%rsp
   0x0000000000013758 <+8>:     mov    %rbx,-0x28(%rbp)
   0x000000000001375c <+12>:    mov    %r12,-0x20(%rbp)
   0x0000000000013760 <+16>:    mov    %r13,-0x18(%rbp)
   0x0000000000013764 <+20>:    mov    %r14,-0x10(%rbp)
   0x0000000000013768 <+24>:    mov    %r15,-0x8(%rbp)
   0x000000000001376c <+28>:    callq  0x13771 <kiblnd_tx_complete+33>
   0x0000000000013771 <+33>:    cmpw   $0x0,0x20(%rdi)
   0x0000000000013776 <+38>:    mov    0x18(%rdi),%rbx
   0x000000000001377a <+42>:    mov    %rdi,%r12 
   0x000000000001377d <+45>:    jle    0x13a68 <kiblnd_tx_complete+792>
   0x0000000000013783 <+51>:    xor    %r13d,%r13d
(gdb) list *0x0000000000013771
0x13771 is in kiblnd_tx_complete (/usr/src/debug/lustre-2.4.3/lnet/klnds/o2iblnd/o2iblnd_cb.c:1003).
warning: Source file is more recent than executable.
998     {
999             int           failed = (status != IB_WC_SUCCESS);
1000            kib_conn_t   *conn = tx->tx_conn;
1001            int           idle;
1002
1003            LASSERT (tx->tx_sending > 0);
1004
1005            if (failed) {
1006                    if (conn->ibc_state == IBLND_CONN_ESTABLISHED)
1007                            CNETERR("Tx -> %s cookie "LPX64

'tx' appears to be NULL. It comes in as an argument to the function:

void kiblnd_tx_complete (kib_tx_t *tx, int status)

called as

kiblnd_tx_complete(kiblnd_wreqid2ptr(wc->wr_id), wc->status);

kiblnd_wreq2ptr is a simple wrapper that masks out the last 2 bits:

static inline void *
kiblnd_wreqid2ptr (__u64 wreqid)
{
        return (void *)(((unsigned long)wreqid) & ~IBLND_WID_MASK);
}

Somehow it seems that wc->wr_id is zero. I'm going to keep looking, but we'd appreciate it if someone with more familiarity with ko2iblnd and OFED could take a look.



 Comments   
Comment by Peter Jones [ 28/Jun/14 ]

Liang

Could you please comment?

Thanks

Peter

Comment by Matt Ezell [ 29/Jun/14 ]

I have uploaded a tarball containing a vmcore to /uploads/LU-5271/crash1.tar.gz

The node didn't have kernel-debuginfo installed, but I found the correct RPM in on our local mirror. I extracted the vmlinux file, but crash doesn't seem to want to open it. If you have any problems, let me know. Thanks.

Comment by Matt Ezell [ 29/Jun/14 ]

I had the wrong vmlinux file. I uploaded /uploads/LU-5271/crash1_corrected.tar.gz with the correct one.

Comment by Matt Ezell [ 29/Jun/14 ]

We also saw a single:

Jun 28 19:23:45 rhea262.ccs.ornl.gov kernel: LNetError: 5452:0:(o2iblnd_cb.c:3258:kiblnd_complete()) LBUG

due to the

switch (kiblnd_wreqid2type(wc->wr_id))

not matching and falling to the default case, which LBUGs.

Comment by Matt Ezell [ 29/Jun/14 ]
crash> xbt
#0
        mod NONE, name schedule, RIP 0xffffffff8150e172
        frame start 0xffff881047669d80, end 0xffff881047669dc8, *base 0xffffffffffffffff
xbt: cannot find debuginfo for module 'NONE'

#1
        mod ko2iblnd, name kiblnd_tx_complete, RIP 0xffffffffa0521711
        frame start 0xffff881047669dc8, end 0xffff881047669e48, *base 0xffff881047669e50

#2
        mod ko2iblnd, name kiblnd_complete, RIP 0xffffffffa05234ae
        frame start 0xffff881047669e48, end 0xffff881047669e58, *base 0xffff881047669f40
        XBT_RBX = ffff880825098200
        XBT_R12 = ffff880822cb6cc0
        XBT_R13 = ffff880825098228
        XBT_R14 = 1
        XBT_R15 = 202

#3
        mod ko2iblnd, name kiblnd_scheduler, RIP 0xffffffffa05238a8
        frame start 0xffff881047669e58, end 0xffff881047669f48, *base 0xffff880825f39be8
        sched = ffff880822cb6ca8
        &wait = ffff881047669eb8
        wait = 0 ...
        &wc = ffff881047669e78
        wc = 100000000 ...
        &name = ffff881047669ee8
        name = ffff881047669f48 ...
        did_something = 0
        &busy_loops = ffff881047669e74
        busy_loops = ffff8808 ...
        wc = ffff881047669e78
        num_entries = 1
        subsystem = 800
        mask = 200

I don't have a debug version of the Mellanox IB modules, but I pulled in the inkernel ib_core debug module. I don't expect ib_wc or many of the constants to be any different.

crash> struct ib_wc ffff881047669e78
struct ib_wc {
  wr_id = 4294967296, 
  status = IB_WC_SUCCESS, 
  opcode = IB_WC_SEND, 
  vendor_err = 5, 
  byte_len = 0, 
  qp = 0xb8000000f9, 
  ex = {
    imm_data = 1872179200, 
    invalidate_rkey = 1872179200
  }, 
  src_qp = 4294936592, 
  wc_flags = 0, 
  pkey_index = 53505, 
  slid = 0, 
  sl = 0 '\000', 
  dlid_path_bits = 0 '\000', 
  port_num = 0 '\000'
}
Comment by Liang Zhen (Inactive) [ 30/Jun/14 ]

Hi Matt, thanks for your information, this is strange, especially in the second crash, which hit this:
LNetError: 5452:0:(o2iblnd_cb.c:3258:kiblnd_complete()) LBUG

it means low bits of wr_id is 3, but ko2iblnd will never set low bits of wr_id to 3 even something wrong happened, so I suspect OFED returned invalid wr_id for unknown reason.

Also, In your previous comment:

  wr_id = 4294967296, 
  status = IB_WC_SUCCESS, 
  opcode = IB_WC_SEND, 
  vendor_err = 5, 
  byte_len = 0, 
  qp = 0xb8000000f9, 

A few findings from this structure:

  • vendor_err is 5, I can't find out what it means, is there a way to ask Mellanox engineer about what vendor_err==5 is?
  • wr_id is 4294967296, which is 0x100000000, it's definitely not a valid wr_id because o2iblnd is setting wr_id to address+opcode
  • qp is 0xb8000000f9, it's an invalid pointer as well, so I think this proved ib_wc is a failed because of vendor_err, or just corrupted.

A few more questions may be helpful:

  • Which version of Mellanox OFED and which driver (mlx4?) are you using?
  • When do this issue start to happen, e.g. after upgrading lustre, kernel, firmware or ofed?
  • These clients hit this problem randomly or under some specific workload?
Comment by Matt Ezell [ 30/Jun/14 ]

We'll get a question out to Mellanox about vendor error 5. If I had to guess from the source code, I'd say it was MLX5_CQE_SYNDROME_WR_FLUSH_ERR, but that should set the status to IB_WC_WR_FLUSH_ERR.

Not directly related to this issue, but I wonder, if the status is NOT IB_WC_SUCCESS, can we assume that the wr_id is valid? Is it safe to dereference tx in the LASSERT before checking 'failed'?

Comment by Matt Ezell [ 30/Jun/14 ]
  • MLNX_OFED_LINUX-2.0-3.0.0-rhel6.4-x86_64.tgz using the mlx5 driver
  • Kernel, Lustre, and OFED versions have not changed. We added new compute nodes and routers, so the LNET config changed (to add new routers). We also flashed the BIOS to make sure all nodes (new and old) had the same version. We are running a test today where we roll back the BIOS to rule that out
  • The test that has been hitting it so far is just an IOR
Comment by Matt Ezell [ 01/Jul/14 ]

According to Mellanox:

When status is IB_WC_SUCCESS, vendor_err doesn't contain any valuable value and as it is not set by driver.

Comment by Liang Zhen (Inactive) [ 01/Jul/14 ]

Hi Matt, thanks for getting this information.
Now I suspect ib_poll_cq returned +ve but it did not return a valid ib_wc, at least I can see this possibility from source code of mlx5/4:
ib_poll_cq->mlx5_ib_poll_cq->mlx5_poll_one

        opcode = cqe64->op_own >> 4;
        switch (opcode) {
        case MLX5_CQE_REQ:
                 ...
        case MLX5_CQE_RESP_WR_IMM:
                 ....
        case MLX5_CQE_RESP_ERR:
                 ...
        }

ib_wc:wr_id should always be set in this "switch", however, there is no "default" case in this switch, which means if opcode is an unknown value for some reasons (needs to confirm with Mellanox if it is possible), then ib_wc::wr_id will not be set at all. At the meanwhile, from description of ib_poll_cq, ib_wc::wr_id is supposed to be valid if ib_poll_cq returned +ve, even when the completion is failed (ib_wc::status != IB_WC_SUCCESS), and this is the reason that there is no check for it in o2iblnd. If this truly happened, then kiblnd_complete will use random number from stack as pointer of tx/rx, which can explain these different crashes.

I posted a debug patch for this (http://review.whamcloud.com/10912), it could be helpful if you can try with it, if the crash is replaced by the new assertion failure added by this patch, then we can verify my conjecture, and we only need to change the LASSERTF to a CWARN as solution.

Comment by Matt Ezell [ 01/Jul/14 ]

Thanks for the analysis, Liang.

Since 'wc' is initialized on the stack and reused in the loop, wouldn't we expect valid-but-stale pointers (from the previous completion event) if the IB driver wasn't setting the wr_id? It seems like only the first iteration might have "random" data. Since we don't get this crash immediately after boot, I highly doubt it's during the first iteration, even with multiple kiblnd_scheduler threads running.

Comment by Liang Zhen (Inactive) [ 02/Jul/14 ]

Hi Matt, o2iblnd will hash different NIDs to different scheduler threads, which means it is possible that a thread is scheduled only after adding some new peers (e.g. new routers). It may be difficult to try, but if possible, can this issue be reproduced when w/o those new routers?

Anyway, I agree with you, most likely, it should be a stale pointer if this truly happened, it can actually explain what happened in LU-519, which only showed a couple of times in last a few years.
When we say stale pointer, it actually can be invalid pointer because o2iblnd can dynamically increase/shrink TX pool, which means if this happened, we may have memory corruption and run into very uncertain situation.

I also added more debug information into my patch, it can also provide some helpful information even if my assumption is wrong.

Another question is, you said roll back BIOS can avoid this issue? if so, it more like a low level stack bug to me.

Comment by Matt Ezell [ 03/Jul/14 ]

When we went back to using just the original routers, the problem went away. We have not yet determined the difference between the old and new routers - BIOS version, IB card model, IB card firmware, etc are all the same. They boot the same NFSRoot image, so the OFED version and Kernel version are exactly the same. We are looking into the fabric to see if the new routers got assigned weird routes, possibly through older switches that only support 2K MTU (we've seen this before). We have a small pool of compute nodes set aside to test with just the new routers, and attempt to narrow down the specific problem. Next week we will try your debug patch, as long as this small testbed is still having problems.

Also, this is using mlx4, not mlx5 as I stated above. Sorry for any confusion.

Thanks again for your help, Liang.

Comment by James Nunez (Inactive) [ 23/Jul/14 ]

Matt,

Do you have new information from the debug patch or should we close this ticket since you believe that this issue was caused by hardware?

Thanks,
James

Comment by Matt Ezell [ 23/Jul/14 ]

At this point we believe it was a hardware issue. Let's close it.

If we get a chance to try to debug patch and find anything useful, we'll report back here.

Thanks.

Generated at Sat Feb 10 01:50:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.