Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15453

MDT shutdown hangs on mutex_lock, possibly cld_lock

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • None
    • lustre-2.12.7_2.llnl-2.ch6.x86_64
      zfs-0.7.11-9.8llnl.ch6.x86_64
      3.10.0-1160.45.1.1chaos.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      LNet issues (See LU-15234 and LU-14026) result in clients and lustre servers reporting via console logs that they lost connection to the MGS.

      We are working on solving the LNet issues, but this may also be revealing error-path issues that should be fixed.

      MDT0, which is usually running on the same server as the MGS, is one of the targets which reports a lost connection (they are separate devices, stored in distinct datasets, started/stopped separately):

      MGC172.19.3.98@o2ib600: Connection to MGS (at 0@lo) was lost 

      Attempting to shutdown the MDT hangs, with this stack reported by the watchdog:

       schedule_preempt_disabled+0x39/0x90
       __mutex_lock_slowpath+0x10f/0x250
       mutex_lock+0x32/0x42
       mgc_process_config+0x21a/0x1420 [mgc]
       obd_process_config.constprop.14+0x75/0x210 [obdclass]
       ? lprocfs_counter_add+0xf9/0x160 [obdclass]
       lustre_end_log+0x1ff/0x550 [obdclass]
       server_put_super+0x82e/0xd00 [obdclass]
       generic_shutdown_super+0x6d/0x110
       kill_anon_super+0x12/0x20
       lustre_kill_super+0x32/0x50 [obdclass]
       deactivate_locked_super+0x4e/0x70
       deactivate_super+0x46/0x60
       cleanup_mnt+0x3f/0x80
       __cleanup_mnt+0x12/0x20
       task_work_run+0xbb/0xf0
       do_notify_resume+0xa5/0xc0
       int_signal+0x12/0x17
      

      The server was crashed and a dump collected.  The stacks for the umount process and the ll_cfg_requeue process both have pointers to the "ls1-mdtir" config_llog_data structure; I believe cld->cld_lock is held by ll_cfg_requeue and umount is waiting on it.

      PID: 4504   TASK: ffff8e8c9edc8000  CPU: 24  COMMAND: "ll_cfg_requeue"
       #0 [ffff8e8ac474f970] __schedule at ffffffff9d3b6788
       #1 [ffff8e8ac474f9d8] schedule at ffffffff9d3b6ce9
       #2 [ffff8e8ac474f9e8] schedule_timeout at ffffffff9d3b4528
       #3 [ffff8e8ac474fa98] ldlm_completion_ast at ffffffffc14ac650 [ptlrpc]
       #4 [ffff8e8ac474fb40] ldlm_cli_enqueue_fini at ffffffffc14ae83f [ptlrpc]
       #5 [ffff8e8ac474fbf0] ldlm_cli_enqueue at ffffffffc14b10d1 [ptlrpc]
       #6 [ffff8e8ac474fca8] mgc_enqueue at ffffffffc0fb94cf [mgc]
       #7 [ffff8e8ac474fd70] mgc_process_log at ffffffffc0fbf393 [mgc]
       #8 [ffff8e8ac474fe30] mgc_requeue_thread at ffffffffc0fc1b10 [mgc]
       #9 [ffff8e8ac474fec8] kthread at ffffffff9cccb221
      

      I can provide console logs and the crash dump.  I do not have lustre debug logs.

      Attachments

        1. bt.a.txt
          51 kB
        2. foreach.bt.txt
          571 kB

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: