Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9688

Stuck MDT in lod_qos_prep_create

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • None
    • Lustre 2.9.0
    • None
    • 3.10.0-514.10.2.el7_lustre.x86_64, lustre-2.9.0_srcc6-1.el7.centos.x86_64
    • 3
    • 9223372036854775807

    Description

      Our MDT was stuck or barely usable twice in a row lately, and the second time we took a crash dump, which shows that several threads were blocked in lod_qos_prep_create...

      PID: 291558  TASK: ffff88203c7b2f10  CPU: 9   COMMAND: "mdt01_030"
       #0 [ffff881a157f7588] __schedule at ffffffff8168b6a5
       #1 [ffff881a157f75f0] schedule at ffffffff8168bcf9
       #2 [ffff881a157f7600] rwsem_down_write_failed at ffffffff8168d4a5
       #3 [ffff881a157f7688] call_rwsem_down_write_failed at ffffffff81327067
       #4 [ffff881a157f76d0] down_write at ffffffff8168aebd
       #5 [ffff881a157f76e8] lod_qos_prep_create at ffffffffa124031c [lod]
       #6 [ffff881a157f77a8] lod_declare_striped_object at ffffffffa1239a8c [lod]
       #7 [ffff881a157f77f0] lod_declare_object_create at ffffffffa123b0f1 [lod]
       #8 [ffff881a157f7838] mdd_declare_object_create_internal at ffffffffa129d21f [mdd]
       #9 [ffff881a157f7880] mdd_declare_create at ffffffffa1294133 [mdd]
      #10 [ffff881a157f78f0] mdd_create at ffffffffa1295689 [mdd]
      #11 [ffff881a157f79e8] mdt_reint_open at ffffffffa1176f05 [mdt]
      #12 [ffff881a157f7ad8] mdt_reint_rec at ffffffffa116c4a0 [mdt]
      #13 [ffff881a157f7b00] mdt_reint_internal at ffffffffa114edc2 [mdt]
      #14 [ffff881a157f7b38] mdt_intent_reint at ffffffffa114f322 [mdt]
      #15 [ffff881a157f7b78] mdt_intent_policy at ffffffffa1159b9c [mdt]
      #16 [ffff881a157f7bd0] ldlm_lock_enqueue at ffffffffa0b461e7 [ptlrpc]
      #17 [ffff881a157f7c28] ldlm_handle_enqueue0 at ffffffffa0b6f3a3 [ptlrpc]
      #18 [ffff881a157f7cb8] tgt_enqueue at ffffffffa0befe12 [ptlrpc]
      #19 [ffff881a157f7cd8] tgt_request_handle at ffffffffa0bf4275 [ptlrpc]
      #20 [ffff881a157f7d20] ptlrpc_server_handle_request at ffffffffa0ba01fb [ptlrpc]
      #21 [ffff881a157f7de8] ptlrpc_main at ffffffffa0ba42b0 [ptlrpc]
      #22 [ffff881a157f7ec8] kthread at ffffffff810b06ff
      #23 [ffff881a157f7f50] ret_from_fork at ffffffff81696b98
      
      
      
      

      The disk array (from Dell) that we use for the MDT doesn't report any issue. The load was not particularly high. kmem -i does report 76 GB of free memory (60% of TOTAL MEM).

      Attaching the output of `foreach bt`, maybe somebody will have a clue.

       

      Each time, failing over the MDT resumed operations, but the recovery was a bit long and with a few evictions.

      Lustre: oak-MDT0000: Recovery over after 13:39, of 1144 clients 1134 recovered and 10 were evicted.
      
      

      Thanks!
      Stephane

      Attachments

        1. oak-io1-s1.lustre.log
          1.33 MB
        2. oak-io1-s2.lustre.log
          44 kB
        3. oak-md1-s1.foreach_bt.txt
          429 kB
        4. oak-md1-s1.lustre.log
          274 kB

        Activity

          People

            niu Niu Yawei (Inactive)
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: