Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.7.0
-
RHEL 6.5 with MLNX_OFED 2.3 and ConnectX3/ConnectX3 Pro/ConnectIB HW (but I'm guessing is reproducible with any OS and any OFED/upstream kernel).
-
4
-
17090
Description
Unloading IB drivers results in hung task message and driver unloading is stuck forever.
Steps to reproduce:
1) Have a lustre mount to server
2) On server do /etc/init.d/openibd stop
3) openibd script is stuck
4) After 120 seconds, following message is seen in dmesg:
LNetError: 131-3: Received notification of device removal
Please shutdown LNET to allow this to proceed
INFO: task modprobe:2837 blocked for more than 120 seconds.
Not tainted 2.6.32_431.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
modprobe D 0000000000000000 0 2837 2777 0x00000000
ffff88011c649bf8 0000000000000082 00000000ffffffff 00000000ffffffff
ffff88011c649c38 ffffffff81060b13 ffff88011c649c78 00000000811a591f
ffff8800cc23f058 ffff88011c649fd8 000000000000fbc8 ffff8800cc23f058
Call Trace:
[<ffffffff81060b13>] ? perf_event_task_sched_out+0x33/0x70
[<ffffffff8105a570>] ? __dequeue_entity+0x30/0x50
[<ffffffff81528c25>] schedule_timeout+0x215/0x2e0
[<ffffffff81527d80>] ? thread_return+0x4e/0x76e
[<ffffffff815288a3>] wait_for_common+0x123/0x180
[<ffffffff81065df0>] ? default_wake_function+0x0/0x20
[<ffffffff810686da>] ? __cond_resched+0x2a/0x40
[<ffffffff815289bd>] wait_for_completion+0x1d/0x20
[<ffffffffa03170be>] cma_remove_one+0x18e/0x210 [rdma_cm]
[<ffffffffa021f5ff>] ib_unregister_device+0x4f/0x100 [ib_core]
[<ffffffffa0257aa6>] mlx4_ib_remove+0xc6/0x300 [mlx4_ib]
[<ffffffffa0167881>] mlx4_remove_device+0x71/0x90 [mlx4_core]
[<ffffffffa01679b3>] mlx4_unregister_interface+0x43/0x80 [mlx4_core]
[<ffffffffa026f891>] __exit_compat+0x15/0x69 [mlx4_ib]
[<ffffffff810b9454>] sys_delete_module+0x194/0x260
[<ffffffff8152d8ce>] ? do_page_fault+0x3e/0xa0
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
The cause of this is that ko2iblnd does not handle device removal (should probably handle it the same as disconnected event).