Details
-
Bug
-
Resolution: Not a Bug
-
Major
-
None
-
Lustre 2.4.2
-
None
-
RHEL 6 2.6.32-358.23.2.el6.x86_64 Lustre 2.4.2 clients (with a couple patches). Mellanox OFED. Servers are 2.4.3 with patches. Routed environment.
-
3
-
14707
Description
We are having clients crash on one of our clusters. We took a maintenance to add additional routers to the cluster. The BIOS, BMC, and FCB firmware versions were upgraded.
Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: IP: [<ffffffffa0519711>] kiblnd_tx_complete+0x21/0x380 [ko2iblnd] Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: PGD 0 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Oops: 0000 [#1] SMP Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: last sysfs file: /sys/kernel/kexec_crash_loaded Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: CPU 9 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ipmi_devintf ko2iblnd(U) lnet(U) sha512_generic sha256_generic crc32c_intel libcfs(U) netconsole configfs nfsd exportfs autofs4 8021q garp stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_REJECT xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables rdma_ucm(U) ib_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) mlx5_ib(U) mlx5_core(U) mlx4_en(U) mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) mlx4_core(U) compat(U) dm_mirror dm_region_hash dm_log dm_mod uinput wmi acpi_pad dcdbas microcode ahci sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma ipv6 nfs lockd fscache auth_rpcgss nfs_acl sunrpc igb dca ptp pps_core [last unloaded: scsi_wait_scan] Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Pid: 5159, comm: kiblnd_sd_02_02 Not tainted 2.6.32-358.23.2.el6.x86_64 #1 Dell Inc. PowerEdge C6220/03C9JJ Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RIP: 0010:[<ffffffffa0519711>] [<ffffffffa0519711>] kiblnd_tx_complete+0x21/0x380 [ko2iblnd] Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RSP: 0018:ffff881064345df0 EFLAGS: 00010286 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RAX: 0000000000000000 RBX: ffff881055cdf600 RCX: 0000000000000000 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RDX: ffff881070ff6640 RSI: 0000000000000005 RDI: 0000000000000000 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RBP: ffff881064345e40 R08: 0000000000000001 R09: 0000000000000001 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: R10: 0000000000000000 R11: 0000000000000400 R12: ffff881070ff6640 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: R13: ffff881055cdf628 R14: 0000000000000001 R15: 0000000000000202 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: FS: 00007f93c032d700(0000) GS:ffff88089c420000(0000) knlGS:0000000000000000 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: CR2: 0000000000000020 CR3: 0000000001a85000 CR4: 00000000000407e0 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Process kiblnd_sd_02_02 (pid: 5159, threadinfo ffff881064344000, task ffff881067eda080) Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Stack: Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: 0000000000000001 0000000000000082 ffff881064345e40 ffffffff81055ad3 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: <d> 0000000164345e40 ffff881055cdf600 ffff881070ff6640 ffff881055cdf628 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: <d> 0000000000000001 0000000000000202 ffff881064345e50 ffffffffa051b4ae Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Call Trace: Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffffa051b4ae>] kiblnd_complete+0x2e/0xe0 [ko2iblnd] Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffffa051b8a8>] kiblnd_scheduler+0x348/0x7f0 [ko2iblnd] Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffff81063990>] ? default_wake_function+0x0/0x20 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffffa051b560>] ? kiblnd_scheduler+0x0/0x7f0 [ko2iblnd] Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffffa051b560>] ? kiblnd_scheduler+0x0/0x7f0 [ko2iblnd] Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffffa051b560>] ? kiblnd_scheduler+0x0/0x7f0 [ko2iblnd] Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Code: ff ff e9 f8 f8 ff ff 0f 1f 00 55 48 89 e5 48 83 ec 50 48 89 5d d8 4c 89 65 e0 4c 89 6d e8 4c 89 75 f0 4c 89 7d f8 0f 1f 44 00 00 <66> 83 7f 20 00 48 8b 5f 18 49 89 fc 0f 8e e5 02 00 00 45 31 ed Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RIP [<ffffffffa0519711>] kiblnd_tx_complete+0x21/0x380 [ko2iblnd] Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: RSP <ffff881064345df0> Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: CR2: 0000000000000020 Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: ---[ end trace cbe30c69da425fe8 ]--- Jun 27 08:22:58 rhea355.ccs.ornl.gov kernel: Kernel panic - not syncing: Fatal exception
Unfortunately we haven't gotten a kdump for this yet (we have an admin working on it). I tracked this back to the offending line (unfortunately this was on a different node that might have a slightly different Lustre version, so it's possible, but unlikely, that the symbols don't match up)
(gdb) disassemble kiblnd_tx_complete Dump of assembler code for function kiblnd_tx_complete: 0x0000000000013750 <+0>: push %rbp 0x0000000000013751 <+1>: mov %rsp,%rbp 0x0000000000013754 <+4>: sub $0x50,%rsp 0x0000000000013758 <+8>: mov %rbx,-0x28(%rbp) 0x000000000001375c <+12>: mov %r12,-0x20(%rbp) 0x0000000000013760 <+16>: mov %r13,-0x18(%rbp) 0x0000000000013764 <+20>: mov %r14,-0x10(%rbp) 0x0000000000013768 <+24>: mov %r15,-0x8(%rbp) 0x000000000001376c <+28>: callq 0x13771 <kiblnd_tx_complete+33> 0x0000000000013771 <+33>: cmpw $0x0,0x20(%rdi) 0x0000000000013776 <+38>: mov 0x18(%rdi),%rbx 0x000000000001377a <+42>: mov %rdi,%r12 0x000000000001377d <+45>: jle 0x13a68 <kiblnd_tx_complete+792> 0x0000000000013783 <+51>: xor %r13d,%r13d
(gdb) list *0x0000000000013771
0x13771 is in kiblnd_tx_complete (/usr/src/debug/lustre-2.4.3/lnet/klnds/o2iblnd/o2iblnd_cb.c:1003).
warning: Source file is more recent than executable.
998 {
999 int failed = (status != IB_WC_SUCCESS);
1000 kib_conn_t *conn = tx->tx_conn;
1001 int idle;
1002
1003 LASSERT (tx->tx_sending > 0);
1004
1005 if (failed) {
1006 if (conn->ibc_state == IBLND_CONN_ESTABLISHED)
1007 CNETERR("Tx -> %s cookie "LPX64
'tx' appears to be NULL. It comes in as an argument to the function:
void kiblnd_tx_complete (kib_tx_t *tx, int status)
called as
kiblnd_tx_complete(kiblnd_wreqid2ptr(wc->wr_id), wc->status);
kiblnd_wreq2ptr is a simple wrapper that masks out the last 2 bits:
static inline void * kiblnd_wreqid2ptr (__u64 wreqid) { return (void *)(((unsigned long)wreqid) & ~IBLND_WID_MASK); }
Somehow it seems that wc->wr_id is zero. I'm going to keep looking, but we'd appreciate it if someone with more familiarity with ko2iblnd and OFED could take a look.