[LU-3166] (o2iblnd_cb.c:2831:kiblnd_cm_callback()) LBUG Created: 13/Apr/13 Updated: 18/Nov/13 Resolved: 26/Jul/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.5.0, Lustre 2.4.2 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Minh Diep |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mn1 | ||
| Environment: |
OFED-3.5, CentOS6.3 |
||
| Severity: | 2 |
| Rank (Obsolete): | 7719 |
| Description |
|
bonding configuration is setup with IPoIB on OFED-3.5 for active/standby LNET configuration. ko2iblnd with bond0 works well, but once active slave interface is changed to another slave interface, Lustre servers crashed due to kiblnd_cm_callback() LBUG. This didn't happen on OFED-1.5.x, but only happen on OFED-3.5. Here is reproducer. # cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009) Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active) Primary Slave: ib0 (primary_reselect always) Currently Active Slave: ib0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 5000 Down Delay (ms): 0 Slave Interface: ib0 MII Status: up Speed: Unknown Duplex: Unknown Link Failure Count: 0 Permanent HW addr: 80:00:00:48:fe:80 Slave queue ID: 0 Slave Interface: ib1 MII Status: up Speed: Unknown Duplex: Unknown Link Failure Count: 0 Permanent HW addr: 80:00:00:49:fe:80 Slave queue ID: 0 Change slave interface and got LBUG. # ifenslave bond0 -c ib1 Message from syslogd@s15 at Apr 14 03:51:57 ... kernel:LNetError: 1627:0:(o2iblnd_cb.c:2831:kiblnd_cm_callback()) LBUG Message from syslogd@s15 at Apr 14 03:51:57 ... kernel:Kernel panic - not syncing: LBUG here is console messages and backtrace from crashdump. # cat /var/crash/127.0.0.1-2013-04-14-03\:52\:04/vmcore-dmesg.txt --snip-- <6>bonding: bond0: making interface ib1 the new active one. <6>RDMA CM addr change for ndev bond0 used by id ffff88044bc15400 <3>LNetError: 1627:0:(o2iblnd_cb.c:2830:kiblnd_cm_callback()) Unexpected event: 14, status: 0 <0>LNetError: 1627:0:(o2iblnd_cb.c:2831:kiblnd_cm_callback()) LBUG <4>Pid: 1627, comm: rdma_cm <4> <4>Call Trace: <4> [<ffffffffa06e4895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa06e4e97>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa0b66bda>] kiblnd_cm_callback+0x9a/0x1140 [ko2iblnd] <4> [<ffffffffa059da18>] cma_ndev_work_handler+0x48/0xa0 [rdma_cm] <4> [<ffffffffa059d9d0>] ? cma_ndev_work_handler+0x0/0xa0 [rdma_cm] <4> [<ffffffff8108b120>] worker_thread+0x170/0x2a0 <4> [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40 <4> [<ffffffff8108afb0>] ? worker_thread+0x0/0x2a0 <4> [<ffffffff81090626>] kthread+0x96/0xa0 <4> [<ffffffff8100c0ca>] child_rip+0xa/0x20 <4> [<ffffffff81090590>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 <4> <0>Kernel panic - not syncing: LBUG <4>Pid: 1627, comm: rdma_cm Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1 <4>Call Trace: <4> [<ffffffff814e9811>] ? panic+0xa0/0x168 <4> [<ffffffffa06e4eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] <4> [<ffffffffa0b66bda>] ? kiblnd_cm_callback+0x9a/0x1140 [ko2iblnd] <4> [<ffffffffa059da18>] ? cma_ndev_work_handler+0x48/0xa0 [rdma_cm] <4> [<ffffffffa059d9d0>] ? cma_ndev_work_handler+0x0/0xa0 [rdma_cm] <4> [<ffffffff8108b120>] ? worker_thread+0x170/0x2a0 <4> [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40 <4> [<ffffffff8108afb0>] ? worker_thread+0x0/0x2a0 <4> [<ffffffff81090626>] ? kthread+0x96/0xa0 <4> [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 <4> [<ffffffff81090590>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 crash> bt PID: 1627 TASK: ffff88046e13f500 CPU: 0 COMMAND: "rdma_cm" #0 [ffff880464155c08] machine_kexec at ffffffff81031f7b #1 [ffff880464155c68] crash_kexec at ffffffff810b8c22 #2 [ffff880464155d38] panic at ffffffff814e9818 #3 [ffff880464155db8] lbug_with_loc at ffffffffa06e4eeb [libcfs] #4 [ffff880464155dd8] kiblnd_cm_callback at ffffffffa0b66bda [ko2iblnd] #5 [ffff880464155e08] cma_ndev_work_handler at ffffffffa059da18 [rdma_cm] #6 [ffff880464155e38] worker_thread at ffffffff8108b120 #7 [ffff880464155ee8] kthread at ffffffff81090626 #8 [ffff880464155f48] kernel_thread at ffffffff8100c0ca |
| Comments |
| Comment by Peter Jones [ 13/Apr/13 ] |
|
Ihara We have just moved up to RHEL 6.4 on master and OFED 3.5 is not supported for that release. Is this a combination that you are planning to use in production at a customer site? Could the version of OFED in the RHEL distribution meet your needs? Peter |
| Comment by Shuichi Ihara (Inactive) [ 13/Apr/13 ] |
|
Peter, That's why I retested on OFED-3.5 against Mellanox new OFED to generalize problem. We need to check Mellanox supports RHEL6.4's kernel, but even today, potentially there are bugs in soemwhere when we use 3.x based OFED. |
| Comment by Peter Jones [ 13/Apr/13 ] |
|
ok Ihara. Minh can you please assist Ihara with this? |
| Comment by Minh Diep [ 13/Apr/13 ] |
|
Hi Ihara, Did you apply |
| Comment by Liang Zhen (Inactive) [ 13/Apr/13 ] |
|
kiblnd_cm_callback()) Unexpected event: 14, status: 0 hmm... I checked source code of ofed, event 14 is RDMA_CM_EVENT_ADDR_CHANGE #ifdef HAVE_OFED_RDMA_CMEV_ADDRCHANGE
case RDMA_CM_EVENT_ADDR_CHANGE:
LCONSOLE_INFO("Physical link changed (eg hca/port)\n");
return 0;
#endif
|
| Comment by Shuichi Ihara (Inactive) [ 13/Apr/13 ] |
|
Thanks Liang. A header file was missing when these RDMA events are checked with OFED-3.5, then it was failing.. checking if OFED has ib_dma_map_single... yes checking if OFED has RDMA_CM_EVENT_ADDR_CHANGE... no checking if OFED has RDMA_CM_EVENT_TIMEWAIT_EXIT... no checking if OFED has rdma_set_reuseaddr... no I pushed patch to compile correctly for these checks. http://review.whamcloud.com/6048 |
| Comment by Shuichi Ihara (Inactive) [ 14/May/13 ] |
|
A this moment, we didn't have any OFED option for RHEL6.4 except RHEL in-kernel tree OFED. |
| Comment by Minh Diep [ 25/Jul/13 ] |
|
This was landed in Jun 21st. Can I close this, Ihara? |
| Comment by Shuichi Ihara (Inactive) [ 25/Jul/13 ] |
|
Minh, Yes, please. |
| Comment by Minh Diep [ 07/Nov/13 ] |
|
patch for b2_1 http://review.whamcloud.com/#/c/8207/ |