Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6132

Unable to unload ib drivers with lustre loaded

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.7.0
    • RHEL 6.5 with MLNX_OFED 2.3 and ConnectX3/ConnectX3 Pro/ConnectIB HW (but I'm guessing is reproducible with any OS and any OFED/upstream kernel).
    • 4
    • 17090

    Description

      Unloading IB drivers results in hung task message and driver unloading is stuck forever.

      Steps to reproduce:
      1) Have a lustre mount to server
      2) On server do /etc/init.d/openibd stop
      3) openibd script is stuck
      4) After 120 seconds, following message is seen in dmesg:

      LNetError: 131-3: Received notification of device removal
      Please shutdown LNET to allow this to proceed
      INFO: task modprobe:2837 blocked for more than 120 seconds.
      Not tainted 2.6.32_431.el6_lustre.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      modprobe D 0000000000000000 0 2837 2777 0x00000000
      ffff88011c649bf8 0000000000000082 00000000ffffffff 00000000ffffffff
      ffff88011c649c38 ffffffff81060b13 ffff88011c649c78 00000000811a591f
      ffff8800cc23f058 ffff88011c649fd8 000000000000fbc8 ffff8800cc23f058
      Call Trace:
      [<ffffffff81060b13>] ? perf_event_task_sched_out+0x33/0x70
      [<ffffffff8105a570>] ? __dequeue_entity+0x30/0x50
      [<ffffffff81528c25>] schedule_timeout+0x215/0x2e0
      [<ffffffff81527d80>] ? thread_return+0x4e/0x76e
      [<ffffffff815288a3>] wait_for_common+0x123/0x180
      [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
      [<ffffffff810686da>] ? __cond_resched+0x2a/0x40
      [<ffffffff815289bd>] wait_for_completion+0x1d/0x20
      [<ffffffffa03170be>] cma_remove_one+0x18e/0x210 [rdma_cm]
      [<ffffffffa021f5ff>] ib_unregister_device+0x4f/0x100 [ib_core]
      [<ffffffffa0257aa6>] mlx4_ib_remove+0xc6/0x300 [mlx4_ib]
      [<ffffffffa0167881>] mlx4_remove_device+0x71/0x90 [mlx4_core]
      [<ffffffffa01679b3>] mlx4_unregister_interface+0x43/0x80 [mlx4_core]
      [<ffffffffa026f891>] __exit_compat+0x15/0x69 [mlx4_ib]
      [<ffffffff810b9454>] sys_delete_module+0x194/0x260
      [<ffffffff8152d8ce>] ? do_page_fault+0x3e/0xa0
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

      The cause of this is that ko2iblnd does not handle device removal (should probably handle it the same as disconnected event).

      Attachments

        Issue Links

          Activity

            People

              dmiter Dmitry Eremin (Inactive)
              yanb Yan Burman
              Votes:
              0 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

                Created:
                Updated: