Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3166

(o2iblnd_cb.c:2831:kiblnd_cm_callback()) LBUG

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.5.0, Lustre 2.4.2
    • Lustre 2.4.0
    • OFED-3.5, CentOS6.3
    • 2
    • 7719

    Description

      bonding configuration is setup with IPoIB on OFED-3.5 for active/standby LNET configuration. ko2iblnd with bond0 works well, but once active slave interface is changed to another slave interface, Lustre servers crashed due to kiblnd_cm_callback() LBUG. This didn't happen on OFED-1.5.x, but only happen on OFED-3.5.

      Here is reproducer.

      # cat /proc/net/bonding/bond0 
      Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
      
      Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
      Primary Slave: ib0 (primary_reselect always)
      Currently Active Slave: ib0
      MII Status: up
      MII Polling Interval (ms): 100
      Up Delay (ms): 5000
      Down Delay (ms): 0
      
      Slave Interface: ib0
      MII Status: up
      Speed: Unknown
      Duplex: Unknown
      Link Failure Count: 0
      Permanent HW addr: 80:00:00:48:fe:80
      Slave queue ID: 0
      
      Slave Interface: ib1
      MII Status: up
      Speed: Unknown
      Duplex: Unknown
      Link Failure Count: 0
      Permanent HW addr: 80:00:00:49:fe:80
      Slave queue ID: 0
      

      Change slave interface and got LBUG.

      # ifenslave bond0 -c ib1
      
      Message from syslogd@s15 at Apr 14 03:51:57 ...
       kernel:LNetError: 1627:0:(o2iblnd_cb.c:2831:kiblnd_cm_callback()) LBUG
      
      Message from syslogd@s15 at Apr 14 03:51:57 ...
       kernel:Kernel panic - not syncing: LBUG
      

      here is console messages and backtrace from crashdump.

      # cat /var/crash/127.0.0.1-2013-04-14-03\:52\:04/vmcore-dmesg.txt 
      --snip--
      <6>bonding: bond0: making interface ib1 the new active one.
      <6>RDMA CM addr change for ndev bond0 used by id ffff88044bc15400
      <3>LNetError: 1627:0:(o2iblnd_cb.c:2830:kiblnd_cm_callback()) Unexpected event: 14, status: 0
      <0>LNetError: 1627:0:(o2iblnd_cb.c:2831:kiblnd_cm_callback()) LBUG
      <4>Pid: 1627, comm: rdma_cm
      <4>
      <4>Call Trace:
      <4> [<ffffffffa06e4895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa06e4e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa0b66bda>] kiblnd_cm_callback+0x9a/0x1140 [ko2iblnd]
      <4> [<ffffffffa059da18>] cma_ndev_work_handler+0x48/0xa0 [rdma_cm]
      <4> [<ffffffffa059d9d0>] ? cma_ndev_work_handler+0x0/0xa0 [rdma_cm]
      <4> [<ffffffff8108b120>] worker_thread+0x170/0x2a0
      <4> [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
      <4> [<ffffffff8108afb0>] ? worker_thread+0x0/0x2a0
      <4> [<ffffffff81090626>] kthread+0x96/0xa0
      <4> [<ffffffff8100c0ca>] child_rip+0xa/0x20
      <4> [<ffffffff81090590>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 1627, comm: rdma_cm Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1
      <4>Call Trace:
      <4> [<ffffffff814e9811>] ? panic+0xa0/0x168
      <4> [<ffffffffa06e4eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa0b66bda>] ? kiblnd_cm_callback+0x9a/0x1140 [ko2iblnd]
      <4> [<ffffffffa059da18>] ? cma_ndev_work_handler+0x48/0xa0 [rdma_cm]
      <4> [<ffffffffa059d9d0>] ? cma_ndev_work_handler+0x0/0xa0 [rdma_cm]
      <4> [<ffffffff8108b120>] ? worker_thread+0x170/0x2a0
      <4> [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
      <4> [<ffffffff8108afb0>] ? worker_thread+0x0/0x2a0
      <4> [<ffffffff81090626>] ? kthread+0x96/0xa0
      <4> [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
      <4> [<ffffffff81090590>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      
      crash> bt
      PID: 1627   TASK: ffff88046e13f500  CPU: 0   COMMAND: "rdma_cm"
       #0 [ffff880464155c08] machine_kexec at ffffffff81031f7b
       #1 [ffff880464155c68] crash_kexec at ffffffff810b8c22
       #2 [ffff880464155d38] panic at ffffffff814e9818
       #3 [ffff880464155db8] lbug_with_loc at ffffffffa06e4eeb [libcfs]
       #4 [ffff880464155dd8] kiblnd_cm_callback at ffffffffa0b66bda [ko2iblnd]
       #5 [ffff880464155e08] cma_ndev_work_handler at ffffffffa059da18 [rdma_cm]
       #6 [ffff880464155e38] worker_thread at ffffffff8108b120
       #7 [ffff880464155ee8] kthread at ffffffff81090626
       #8 [ffff880464155f48] kernel_thread at ffffffff8100c0ca
      

      Attachments

        Activity

          People

            mdiep Minh Diep
            ihara Shuichi Ihara (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: