Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3166

(o2iblnd_cb.c:2831:kiblnd_cm_callback()) LBUG

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.5.0, Lustre 2.4.2
    • Lustre 2.4.0
    • OFED-3.5, CentOS6.3
    • 2
    • 7719

    Description

      bonding configuration is setup with IPoIB on OFED-3.5 for active/standby LNET configuration. ko2iblnd with bond0 works well, but once active slave interface is changed to another slave interface, Lustre servers crashed due to kiblnd_cm_callback() LBUG. This didn't happen on OFED-1.5.x, but only happen on OFED-3.5.

      Here is reproducer.

      # cat /proc/net/bonding/bond0 
      Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
      
      Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
      Primary Slave: ib0 (primary_reselect always)
      Currently Active Slave: ib0
      MII Status: up
      MII Polling Interval (ms): 100
      Up Delay (ms): 5000
      Down Delay (ms): 0
      
      Slave Interface: ib0
      MII Status: up
      Speed: Unknown
      Duplex: Unknown
      Link Failure Count: 0
      Permanent HW addr: 80:00:00:48:fe:80
      Slave queue ID: 0
      
      Slave Interface: ib1
      MII Status: up
      Speed: Unknown
      Duplex: Unknown
      Link Failure Count: 0
      Permanent HW addr: 80:00:00:49:fe:80
      Slave queue ID: 0
      

      Change slave interface and got LBUG.

      # ifenslave bond0 -c ib1
      
      Message from syslogd@s15 at Apr 14 03:51:57 ...
       kernel:LNetError: 1627:0:(o2iblnd_cb.c:2831:kiblnd_cm_callback()) LBUG
      
      Message from syslogd@s15 at Apr 14 03:51:57 ...
       kernel:Kernel panic - not syncing: LBUG
      

      here is console messages and backtrace from crashdump.

      # cat /var/crash/127.0.0.1-2013-04-14-03\:52\:04/vmcore-dmesg.txt 
      --snip--
      <6>bonding: bond0: making interface ib1 the new active one.
      <6>RDMA CM addr change for ndev bond0 used by id ffff88044bc15400
      <3>LNetError: 1627:0:(o2iblnd_cb.c:2830:kiblnd_cm_callback()) Unexpected event: 14, status: 0
      <0>LNetError: 1627:0:(o2iblnd_cb.c:2831:kiblnd_cm_callback()) LBUG
      <4>Pid: 1627, comm: rdma_cm
      <4>
      <4>Call Trace:
      <4> [<ffffffffa06e4895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa06e4e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa0b66bda>] kiblnd_cm_callback+0x9a/0x1140 [ko2iblnd]
      <4> [<ffffffffa059da18>] cma_ndev_work_handler+0x48/0xa0 [rdma_cm]
      <4> [<ffffffffa059d9d0>] ? cma_ndev_work_handler+0x0/0xa0 [rdma_cm]
      <4> [<ffffffff8108b120>] worker_thread+0x170/0x2a0
      <4> [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
      <4> [<ffffffff8108afb0>] ? worker_thread+0x0/0x2a0
      <4> [<ffffffff81090626>] kthread+0x96/0xa0
      <4> [<ffffffff8100c0ca>] child_rip+0xa/0x20
      <4> [<ffffffff81090590>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 1627, comm: rdma_cm Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1
      <4>Call Trace:
      <4> [<ffffffff814e9811>] ? panic+0xa0/0x168
      <4> [<ffffffffa06e4eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa0b66bda>] ? kiblnd_cm_callback+0x9a/0x1140 [ko2iblnd]
      <4> [<ffffffffa059da18>] ? cma_ndev_work_handler+0x48/0xa0 [rdma_cm]
      <4> [<ffffffffa059d9d0>] ? cma_ndev_work_handler+0x0/0xa0 [rdma_cm]
      <4> [<ffffffff8108b120>] ? worker_thread+0x170/0x2a0
      <4> [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
      <4> [<ffffffff8108afb0>] ? worker_thread+0x0/0x2a0
      <4> [<ffffffff81090626>] ? kthread+0x96/0xa0
      <4> [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
      <4> [<ffffffff81090590>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      
      crash> bt
      PID: 1627   TASK: ffff88046e13f500  CPU: 0   COMMAND: "rdma_cm"
       #0 [ffff880464155c08] machine_kexec at ffffffff81031f7b
       #1 [ffff880464155c68] crash_kexec at ffffffff810b8c22
       #2 [ffff880464155d38] panic at ffffffff814e9818
       #3 [ffff880464155db8] lbug_with_loc at ffffffffa06e4eeb [libcfs]
       #4 [ffff880464155dd8] kiblnd_cm_callback at ffffffffa0b66bda [ko2iblnd]
       #5 [ffff880464155e08] cma_ndev_work_handler at ffffffffa059da18 [rdma_cm]
       #6 [ffff880464155e38] worker_thread at ffffffff8108b120
       #7 [ffff880464155ee8] kthread at ffffffff81090626
       #8 [ffff880464155f48] kernel_thread at ffffffff8100c0ca
      

      Attachments

        Activity

          [LU-3166] (o2iblnd_cb.c:2831:kiblnd_cm_callback()) LBUG
          mdiep Minh Diep added a comment - patch for b2_1 http://review.whamcloud.com/#/c/8207/ patch for b2_4 http://review.whamcloud.com/#/c/8205/

          Minh, Yes, please.

          ihara Shuichi Ihara (Inactive) added a comment - Minh, Yes, please.
          mdiep Minh Diep added a comment -

          This was landed in Jun 21st. Can I close this, Ihara?

          mdiep Minh Diep added a comment - This was landed in Jun 21st. Can I close this, Ihara?

          A this moment, we didn't have any OFED option for RHEL6.4 except RHEL in-kernel tree OFED.
          Howerver, this patches will be needed since new OFED is released out from Mellanox. MLNX_OFED_LINUX-2.0-2.0.5 which is based on OFED-3.x compat rdma headers.

          ihara Shuichi Ihara (Inactive) added a comment - A this moment, we didn't have any OFED option for RHEL6.4 except RHEL in-kernel tree OFED. Howerver, this patches will be needed since new OFED is released out from Mellanox. MLNX_OFED_LINUX-2.0-2.0.5 which is based on OFED-3.x compat rdma headers.

          Thanks Liang. A header file was missing when these RDMA events are checked with OFED-3.5, then it was failing..

          checking if OFED has ib_dma_map_single... yes
          checking if OFED has RDMA_CM_EVENT_ADDR_CHANGE... no
          checking if OFED has RDMA_CM_EVENT_TIMEWAIT_EXIT... no
          checking if OFED has rdma_set_reuseaddr... no
          

          I pushed patch to compile correctly for these checks. http://review.whamcloud.com/6048

          ihara Shuichi Ihara (Inactive) added a comment - Thanks Liang. A header file was missing when these RDMA events are checked with OFED-3.5, then it was failing.. checking if OFED has ib_dma_map_single... yes checking if OFED has RDMA_CM_EVENT_ADDR_CHANGE... no checking if OFED has RDMA_CM_EVENT_TIMEWAIT_EXIT... no checking if OFED has rdma_set_reuseaddr... no I pushed patch to compile correctly for these checks. http://review.whamcloud.com/6048

          kiblnd_cm_callback()) Unexpected event: 14, status: 0

          hmm... I checked source code of ofed, event 14 is RDMA_CM_EVENT_ADDR_CHANGE
          we do have code to check this event for very long time, unless o2iblnd is built against old OFED version...

          #ifdef HAVE_OFED_RDMA_CMEV_ADDRCHANGE
                  case RDMA_CM_EVENT_ADDR_CHANGE:
                          LCONSOLE_INFO("Physical link changed (eg hca/port)\n");
                          return 0;
          #endif
          
          liang Liang Zhen (Inactive) added a comment - kiblnd_cm_callback()) Unexpected event: 14, status: 0 hmm... I checked source code of ofed, event 14 is RDMA_CM_EVENT_ADDR_CHANGE we do have code to check this event for very long time, unless o2iblnd is built against old OFED version... #ifdef HAVE_OFED_RDMA_CMEV_ADDRCHANGE case RDMA_CM_EVENT_ADDR_CHANGE: LCONSOLE_INFO("Physical link changed (eg hca/port)\n"); return 0; #endif
          mdiep Minh Diep added a comment -

          Hi Ihara,

          Did you apply LU-2975 patch to make OFED-3.5 work on rhel6.3?

          mdiep Minh Diep added a comment - Hi Ihara, Did you apply LU-2975 patch to make OFED-3.5 work on rhel6.3?
          pjones Peter Jones added a comment -

          ok Ihara. Minh can you please assist Ihara with this?

          pjones Peter Jones added a comment - ok Ihara. Minh can you please assist Ihara with this?
          ihara Shuichi Ihara (Inactive) added a comment - - edited

          Peter,
          OK..
          Well, As far as I know, Mellanox is going to release new their new OFED (call Mellanox OFED 2.0) which is based on OFED-3.x. There are several important improvements for SRP in this OFED.
          LU-1468 helped to build Lustre with this new Mellanox OFED, but server crashed either on this Mellanox OFED.

          That's why I retested on OFED-3.5 against Mellanox new OFED to generalize problem. We need to check Mellanox supports RHEL6.4's kernel, but even today, potentially there are bugs in soemwhere when we use 3.x based OFED.

          ihara Shuichi Ihara (Inactive) added a comment - - edited Peter, OK.. Well, As far as I know, Mellanox is going to release new their new OFED (call Mellanox OFED 2.0) which is based on OFED-3.x. There are several important improvements for SRP in this OFED. LU-1468 helped to build Lustre with this new Mellanox OFED, but server crashed either on this Mellanox OFED. That's why I retested on OFED-3.5 against Mellanox new OFED to generalize problem. We need to check Mellanox supports RHEL6.4's kernel, but even today, potentially there are bugs in soemwhere when we use 3.x based OFED.
          pjones Peter Jones added a comment -

          Ihara

          We have just moved up to RHEL 6.4 on master and OFED 3.5 is not supported for that release. Is this a combination that you are planning to use in production at a customer site? Could the version of OFED in the RHEL distribution meet your needs?

          Peter

          pjones Peter Jones added a comment - Ihara We have just moved up to RHEL 6.4 on master and OFED 3.5 is not supported for that release. Is this a combination that you are planning to use in production at a customer site? Could the version of OFED in the RHEL distribution meet your needs? Peter

          People

            mdiep Minh Diep
            ihara Shuichi Ihara (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: