Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
lustre-2.12.7_2.llnl-2.ch6.x86_64
zfs-0.7.11-9.8llnl.ch6.x86_64
3.10.0-1160.45.1.1chaos.ch6.x86_64
-
3
-
9223372036854775807
Description
LNet issues (See LU-15234 and LU-14026) result in clients and lustre servers reporting via console logs that they lost connection to the MGS.
We are working on solving the LNet issues, but this may also be revealing error-path issues that should be fixed.
MDT0, which is usually running on the same server as the MGS, is one of the targets which reports a lost connection (they are separate devices, stored in distinct datasets, started/stopped separately):
MGC172.19.3.98@o2ib600: Connection to MGS (at 0@lo) was lost
Attempting to shutdown the MDT hangs, with this stack reported by the watchdog:
schedule_preempt_disabled+0x39/0x90 __mutex_lock_slowpath+0x10f/0x250 mutex_lock+0x32/0x42 mgc_process_config+0x21a/0x1420 [mgc] obd_process_config.constprop.14+0x75/0x210 [obdclass] ? lprocfs_counter_add+0xf9/0x160 [obdclass] lustre_end_log+0x1ff/0x550 [obdclass] server_put_super+0x82e/0xd00 [obdclass] generic_shutdown_super+0x6d/0x110 kill_anon_super+0x12/0x20 lustre_kill_super+0x32/0x50 [obdclass] deactivate_locked_super+0x4e/0x70 deactivate_super+0x46/0x60 cleanup_mnt+0x3f/0x80 __cleanup_mnt+0x12/0x20 task_work_run+0xbb/0xf0 do_notify_resume+0xa5/0xc0 int_signal+0x12/0x17
The server was crashed and a dump collected. The stacks for the umount process and the ll_cfg_requeue process both have pointers to the "ls1-mdtir" config_llog_data structure; I believe cld->cld_lock is held by ll_cfg_requeue and umount is waiting on it.
PID: 4504 TASK: ffff8e8c9edc8000 CPU: 24 COMMAND: "ll_cfg_requeue" #0 [ffff8e8ac474f970] __schedule at ffffffff9d3b6788 #1 [ffff8e8ac474f9d8] schedule at ffffffff9d3b6ce9 #2 [ffff8e8ac474f9e8] schedule_timeout at ffffffff9d3b4528 #3 [ffff8e8ac474fa98] ldlm_completion_ast at ffffffffc14ac650 [ptlrpc] #4 [ffff8e8ac474fb40] ldlm_cli_enqueue_fini at ffffffffc14ae83f [ptlrpc] #5 [ffff8e8ac474fbf0] ldlm_cli_enqueue at ffffffffc14b10d1 [ptlrpc] #6 [ffff8e8ac474fca8] mgc_enqueue at ffffffffc0fb94cf [mgc] #7 [ffff8e8ac474fd70] mgc_process_log at ffffffffc0fbf393 [mgc] #8 [ffff8e8ac474fe30] mgc_requeue_thread at ffffffffc0fc1b10 [mgc] #9 [ffff8e8ac474fec8] kthread at ffffffff9cccb221
I can provide console logs and the crash dump. I do not have lustre debug logs.