Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2419

mdt threads stuck in ldlm_expired_completion_wait

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.1.2
    • Lustre 2.1.2-3chaos (github.com/chaos/lustre)
    • 3
    • 5736

    Description

      One of our production MDS is in trouble, causing application hangs. It looks like CPU usage is low, but the node has mdt threads hanging for 800+ seconds before timeout. It is frequently printing backtraces like so:

      2012-11-30 16:32:02 Lustre: Service thread pid 4557 was inactive for 808.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging pur
      poses:
      2012-11-30 16:32:02 Lustre: Skipped 4 previous similar messages
      2012-11-30 16:32:02 Pid: 4557, comm: mdt_294
      2012-11-30 16:32:02 
      2012-11-30 16:32:02 Call Trace:
      2012-11-30 16:32:02  [<ffffffffa071c590>] ? ldlm_expired_completion_wait+0x0/0x270 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa04913f1>] ? libcfs_debug_vmsg1+0x41/0x50 [libcfs]
      2012-11-30 16:32:02  [<ffffffffa071c590>] ? ldlm_expired_completion_wait+0x0/0x270 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa048854e>] cfs_waitq_wait+0xe/0x10 [libcfs]
      2012-11-30 16:32:02  [<ffffffffa071fe6a>] ldlm_completion_ast+0x4da/0x690 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
      2012-11-30 16:32:02  [<ffffffffa071f706>] ldlm_cli_enqueue_local+0x1e6/0x470 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa071f990>] ? ldlm_completion_ast+0x0/0x690 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa0c59180>] ? mdt_blocking_ast+0x0/0x230 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c5ae5f>] mdt_object_lock+0x28f/0x980 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c59180>] ? mdt_blocking_ast+0x0/0x230 [mdt]
      2012-11-30 16:32:02  [<ffffffffa071f990>] ? ldlm_completion_ast+0x0/0x690 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa0c5b871>] mdt_object_find_lock+0x61/0x100 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c70fe2>] mdt_md_create+0x102/0x5a0 [mdt]
      2012-11-30 16:32:02  [<ffffffffa03af96c>] ? lprocfs_counter_add+0x11c/0x190 [lvfs]
      2012-11-30 16:32:02  [<ffffffffa0c71598>] mdt_reint_create+0x118/0x5e0 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c6f2d0>] mdt_reint_rec+0x40/0xb0 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0740eb4>] ? lustre_msg_get_flags+0x34/0x70 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa0c6a0c8>] mdt_reint_internal+0x4f8/0x770 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c6a384>] mdt_reint+0x44/0xc0 [mdt]
      2012-11-30 16:32:03  [<ffffffffa0c5e79d>] mdt_handle_common+0x73d/0x12c0 [mdt]
      2012-11-30 16:32:03  [<ffffffffa0740cc4>] ? lustre_msg_get_transno+0x54/0x90 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffffa0c5f3f5>] mdt_regular_handle+0x15/0x20 [mdt]
      2012-11-30 16:32:03  [<ffffffffa074cd64>] ptlrpc_main+0xd24/0x1740 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffff8100c14a>] child_rip+0xa/0x20
      2012-11-30 16:32:03  [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      

      See attached file console.momus-mds1.txt for more of the console log, including backtraces from the processes on the system.

      Attachments

        1. console.momus-mds1.txt
          1.16 MB
          Christopher Morrone

        Issue Links

          Activity

            People

              bzzz Alex Zhuravlev
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: