Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.4.0
-
OFED-3.5, CentOS6.3
-
2
-
7719
Description
bonding configuration is setup with IPoIB on OFED-3.5 for active/standby LNET configuration. ko2iblnd with bond0 works well, but once active slave interface is changed to another slave interface, Lustre servers crashed due to kiblnd_cm_callback() LBUG. This didn't happen on OFED-1.5.x, but only happen on OFED-3.5.
Here is reproducer.
# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009) Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active) Primary Slave: ib0 (primary_reselect always) Currently Active Slave: ib0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 5000 Down Delay (ms): 0 Slave Interface: ib0 MII Status: up Speed: Unknown Duplex: Unknown Link Failure Count: 0 Permanent HW addr: 80:00:00:48:fe:80 Slave queue ID: 0 Slave Interface: ib1 MII Status: up Speed: Unknown Duplex: Unknown Link Failure Count: 0 Permanent HW addr: 80:00:00:49:fe:80 Slave queue ID: 0
Change slave interface and got LBUG.
# ifenslave bond0 -c ib1 Message from syslogd@s15 at Apr 14 03:51:57 ... kernel:LNetError: 1627:0:(o2iblnd_cb.c:2831:kiblnd_cm_callback()) LBUG Message from syslogd@s15 at Apr 14 03:51:57 ... kernel:Kernel panic - not syncing: LBUG
here is console messages and backtrace from crashdump.
# cat /var/crash/127.0.0.1-2013-04-14-03\:52\:04/vmcore-dmesg.txt --snip-- <6>bonding: bond0: making interface ib1 the new active one. <6>RDMA CM addr change for ndev bond0 used by id ffff88044bc15400 <3>LNetError: 1627:0:(o2iblnd_cb.c:2830:kiblnd_cm_callback()) Unexpected event: 14, status: 0 <0>LNetError: 1627:0:(o2iblnd_cb.c:2831:kiblnd_cm_callback()) LBUG <4>Pid: 1627, comm: rdma_cm <4> <4>Call Trace: <4> [<ffffffffa06e4895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa06e4e97>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa0b66bda>] kiblnd_cm_callback+0x9a/0x1140 [ko2iblnd] <4> [<ffffffffa059da18>] cma_ndev_work_handler+0x48/0xa0 [rdma_cm] <4> [<ffffffffa059d9d0>] ? cma_ndev_work_handler+0x0/0xa0 [rdma_cm] <4> [<ffffffff8108b120>] worker_thread+0x170/0x2a0 <4> [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40 <4> [<ffffffff8108afb0>] ? worker_thread+0x0/0x2a0 <4> [<ffffffff81090626>] kthread+0x96/0xa0 <4> [<ffffffff8100c0ca>] child_rip+0xa/0x20 <4> [<ffffffff81090590>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 <4> <0>Kernel panic - not syncing: LBUG <4>Pid: 1627, comm: rdma_cm Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1 <4>Call Trace: <4> [<ffffffff814e9811>] ? panic+0xa0/0x168 <4> [<ffffffffa06e4eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] <4> [<ffffffffa0b66bda>] ? kiblnd_cm_callback+0x9a/0x1140 [ko2iblnd] <4> [<ffffffffa059da18>] ? cma_ndev_work_handler+0x48/0xa0 [rdma_cm] <4> [<ffffffffa059d9d0>] ? cma_ndev_work_handler+0x0/0xa0 [rdma_cm] <4> [<ffffffff8108b120>] ? worker_thread+0x170/0x2a0 <4> [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40 <4> [<ffffffff8108afb0>] ? worker_thread+0x0/0x2a0 <4> [<ffffffff81090626>] ? kthread+0x96/0xa0 <4> [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 <4> [<ffffffff81090590>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 crash> bt PID: 1627 TASK: ffff88046e13f500 CPU: 0 COMMAND: "rdma_cm" #0 [ffff880464155c08] machine_kexec at ffffffff81031f7b #1 [ffff880464155c68] crash_kexec at ffffffff810b8c22 #2 [ffff880464155d38] panic at ffffffff814e9818 #3 [ffff880464155db8] lbug_with_loc at ffffffffa06e4eeb [libcfs] #4 [ffff880464155dd8] kiblnd_cm_callback at ffffffffa0b66bda [ko2iblnd] #5 [ffff880464155e08] cma_ndev_work_handler at ffffffffa059da18 [rdma_cm] #6 [ffff880464155e38] worker_thread at ffffffff8108b120 #7 [ffff880464155ee8] kthread at ffffffff81090626 #8 [ffff880464155f48] kernel_thread at ffffffff8100c0ca