[LU-3166] (o2iblnd_cb.c:2831:kiblnd_cm_callback()) LBUG Created: 13/Apr/13  Updated: 18/Nov/13  Resolved: 26/Jul/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.5.0, Lustre 2.4.2

Type: Bug Priority: Critical
Reporter: Shuichi Ihara (Inactive) Assignee: Minh Diep
Resolution: Fixed Votes: 0
Labels: mn1
Environment:

OFED-3.5, CentOS6.3


Severity: 2
Rank (Obsolete): 7719

 Description   

bonding configuration is setup with IPoIB on OFED-3.5 for active/standby LNET configuration. ko2iblnd with bond0 works well, but once active slave interface is changed to another slave interface, Lustre servers crashed due to kiblnd_cm_callback() LBUG. This didn't happen on OFED-1.5.x, but only happen on OFED-3.5.

Here is reproducer.

# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)

Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: ib0 (primary_reselect always)
Currently Active Slave: ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 5000
Down Delay (ms): 0

Slave Interface: ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr: 80:00:00:48:fe:80
Slave queue ID: 0

Slave Interface: ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr: 80:00:00:49:fe:80
Slave queue ID: 0

Change slave interface and got LBUG.

# ifenslave bond0 -c ib1

Message from syslogd@s15 at Apr 14 03:51:57 ...
 kernel:LNetError: 1627:0:(o2iblnd_cb.c:2831:kiblnd_cm_callback()) LBUG

Message from syslogd@s15 at Apr 14 03:51:57 ...
 kernel:Kernel panic - not syncing: LBUG

here is console messages and backtrace from crashdump.

# cat /var/crash/127.0.0.1-2013-04-14-03\:52\:04/vmcore-dmesg.txt 
--snip--
<6>bonding: bond0: making interface ib1 the new active one.
<6>RDMA CM addr change for ndev bond0 used by id ffff88044bc15400
<3>LNetError: 1627:0:(o2iblnd_cb.c:2830:kiblnd_cm_callback()) Unexpected event: 14, status: 0
<0>LNetError: 1627:0:(o2iblnd_cb.c:2831:kiblnd_cm_callback()) LBUG
<4>Pid: 1627, comm: rdma_cm
<4>
<4>Call Trace:
<4> [<ffffffffa06e4895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa06e4e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0b66bda>] kiblnd_cm_callback+0x9a/0x1140 [ko2iblnd]
<4> [<ffffffffa059da18>] cma_ndev_work_handler+0x48/0xa0 [rdma_cm]
<4> [<ffffffffa059d9d0>] ? cma_ndev_work_handler+0x0/0xa0 [rdma_cm]
<4> [<ffffffff8108b120>] worker_thread+0x170/0x2a0
<4> [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
<4> [<ffffffff8108afb0>] ? worker_thread+0x0/0x2a0
<4> [<ffffffff81090626>] kthread+0x96/0xa0
<4> [<ffffffff8100c0ca>] child_rip+0xa/0x20
<4> [<ffffffff81090590>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 1627, comm: rdma_cm Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1
<4>Call Trace:
<4> [<ffffffff814e9811>] ? panic+0xa0/0x168
<4> [<ffffffffa06e4eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa0b66bda>] ? kiblnd_cm_callback+0x9a/0x1140 [ko2iblnd]
<4> [<ffffffffa059da18>] ? cma_ndev_work_handler+0x48/0xa0 [rdma_cm]
<4> [<ffffffffa059d9d0>] ? cma_ndev_work_handler+0x0/0xa0 [rdma_cm]
<4> [<ffffffff8108b120>] ? worker_thread+0x170/0x2a0
<4> [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
<4> [<ffffffff8108afb0>] ? worker_thread+0x0/0x2a0
<4> [<ffffffff81090626>] ? kthread+0x96/0xa0
<4> [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
<4> [<ffffffff81090590>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

crash> bt
PID: 1627   TASK: ffff88046e13f500  CPU: 0   COMMAND: "rdma_cm"
 #0 [ffff880464155c08] machine_kexec at ffffffff81031f7b
 #1 [ffff880464155c68] crash_kexec at ffffffff810b8c22
 #2 [ffff880464155d38] panic at ffffffff814e9818
 #3 [ffff880464155db8] lbug_with_loc at ffffffffa06e4eeb [libcfs]
 #4 [ffff880464155dd8] kiblnd_cm_callback at ffffffffa0b66bda [ko2iblnd]
 #5 [ffff880464155e08] cma_ndev_work_handler at ffffffffa059da18 [rdma_cm]
 #6 [ffff880464155e38] worker_thread at ffffffff8108b120
 #7 [ffff880464155ee8] kthread at ffffffff81090626
 #8 [ffff880464155f48] kernel_thread at ffffffff8100c0ca


 Comments   
Comment by Peter Jones [ 13/Apr/13 ]

Ihara

We have just moved up to RHEL 6.4 on master and OFED 3.5 is not supported for that release. Is this a combination that you are planning to use in production at a customer site? Could the version of OFED in the RHEL distribution meet your needs?

Peter

Comment by Shuichi Ihara (Inactive) [ 13/Apr/13 ]

Peter,
OK..
Well, As far as I know, Mellanox is going to release new their new OFED (call Mellanox OFED 2.0) which is based on OFED-3.x. There are several important improvements for SRP in this OFED.
LU-1468 helped to build Lustre with this new Mellanox OFED, but server crashed either on this Mellanox OFED.

That's why I retested on OFED-3.5 against Mellanox new OFED to generalize problem. We need to check Mellanox supports RHEL6.4's kernel, but even today, potentially there are bugs in soemwhere when we use 3.x based OFED.

Comment by Peter Jones [ 13/Apr/13 ]

ok Ihara. Minh can you please assist Ihara with this?

Comment by Minh Diep [ 13/Apr/13 ]

Hi Ihara,

Did you apply LU-2975 patch to make OFED-3.5 work on rhel6.3?

Comment by Liang Zhen (Inactive) [ 13/Apr/13 ]

kiblnd_cm_callback()) Unexpected event: 14, status: 0

hmm... I checked source code of ofed, event 14 is RDMA_CM_EVENT_ADDR_CHANGE
we do have code to check this event for very long time, unless o2iblnd is built against old OFED version...

#ifdef HAVE_OFED_RDMA_CMEV_ADDRCHANGE
        case RDMA_CM_EVENT_ADDR_CHANGE:
                LCONSOLE_INFO("Physical link changed (eg hca/port)\n");
                return 0;
#endif
Comment by Shuichi Ihara (Inactive) [ 13/Apr/13 ]

Thanks Liang. A header file was missing when these RDMA events are checked with OFED-3.5, then it was failing..

checking if OFED has ib_dma_map_single... yes
checking if OFED has RDMA_CM_EVENT_ADDR_CHANGE... no
checking if OFED has RDMA_CM_EVENT_TIMEWAIT_EXIT... no
checking if OFED has rdma_set_reuseaddr... no

I pushed patch to compile correctly for these checks. http://review.whamcloud.com/6048

Comment by Shuichi Ihara (Inactive) [ 14/May/13 ]

A this moment, we didn't have any OFED option for RHEL6.4 except RHEL in-kernel tree OFED.
Howerver, this patches will be needed since new OFED is released out from Mellanox. MLNX_OFED_LINUX-2.0-2.0.5 which is based on OFED-3.x compat rdma headers.

Comment by Minh Diep [ 25/Jul/13 ]

This was landed in Jun 21st. Can I close this, Ihara?

Comment by Shuichi Ihara (Inactive) [ 25/Jul/13 ]

Minh, Yes, please.

Comment by Minh Diep [ 07/Nov/13 ]

patch for b2_1 http://review.whamcloud.com/#/c/8207/
patch for b2_4 http://review.whamcloud.com/#/c/8205/

Generated at Sat Feb 10 01:31:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.