Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6132

Unable to unload ib drivers with lustre loaded

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • Lustre 2.7.0
    • RHEL 6.5 with MLNX_OFED 2.3 and ConnectX3/ConnectX3 Pro/ConnectIB HW (but I'm guessing is reproducible with any OS and any OFED/upstream kernel).
    • 4
    • 17090

      Unloading IB drivers results in hung task message and driver unloading is stuck forever.

      Steps to reproduce:
      1) Have a lustre mount to server
      2) On server do /etc/init.d/openibd stop
      3) openibd script is stuck
      4) After 120 seconds, following message is seen in dmesg:

      LNetError: 131-3: Received notification of device removal
      Please shutdown LNET to allow this to proceed
      INFO: task modprobe:2837 blocked for more than 120 seconds.
      Not tainted 2.6.32_431.el6_lustre.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      modprobe D 0000000000000000 0 2837 2777 0x00000000
      ffff88011c649bf8 0000000000000082 00000000ffffffff 00000000ffffffff
      ffff88011c649c38 ffffffff81060b13 ffff88011c649c78 00000000811a591f
      ffff8800cc23f058 ffff88011c649fd8 000000000000fbc8 ffff8800cc23f058
      Call Trace:
      [<ffffffff81060b13>] ? perf_event_task_sched_out+0x33/0x70
      [<ffffffff8105a570>] ? __dequeue_entity+0x30/0x50
      [<ffffffff81528c25>] schedule_timeout+0x215/0x2e0
      [<ffffffff81527d80>] ? thread_return+0x4e/0x76e
      [<ffffffff815288a3>] wait_for_common+0x123/0x180
      [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
      [<ffffffff810686da>] ? __cond_resched+0x2a/0x40
      [<ffffffff815289bd>] wait_for_completion+0x1d/0x20
      [<ffffffffa03170be>] cma_remove_one+0x18e/0x210 [rdma_cm]
      [<ffffffffa021f5ff>] ib_unregister_device+0x4f/0x100 [ib_core]
      [<ffffffffa0257aa6>] mlx4_ib_remove+0xc6/0x300 [mlx4_ib]
      [<ffffffffa0167881>] mlx4_remove_device+0x71/0x90 [mlx4_core]
      [<ffffffffa01679b3>] mlx4_unregister_interface+0x43/0x80 [mlx4_core]
      [<ffffffffa026f891>] __exit_compat+0x15/0x69 [mlx4_ib]
      [<ffffffff810b9454>] sys_delete_module+0x194/0x260
      [<ffffffff8152d8ce>] ? do_page_fault+0x3e/0xa0
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

      The cause of this is that ko2iblnd does not handle device removal (should probably handle it the same as disconnected event).

            dmiter Dmitry Eremin (Inactive)
            yanb Yan Burman
            Votes:
            0 Vote for this issue
            Watchers:
            22 Start watching this issue

              Created:
              Updated: