Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2419

mdt threads stuck in ldlm_expired_completion_wait

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Lustre 2.1.2
    • Fix Version/s: None
    • Labels:
    • Environment:
      Lustre 2.1.2-3chaos (github.com/chaos/lustre)
    • Severity:
      3
    • Rank (Obsolete):
      5736

      Description

      One of our production MDS is in trouble, causing application hangs. It looks like CPU usage is low, but the node has mdt threads hanging for 800+ seconds before timeout. It is frequently printing backtraces like so:

      2012-11-30 16:32:02 Lustre: Service thread pid 4557 was inactive for 808.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging pur
      poses:
      2012-11-30 16:32:02 Lustre: Skipped 4 previous similar messages
      2012-11-30 16:32:02 Pid: 4557, comm: mdt_294
      2012-11-30 16:32:02 
      2012-11-30 16:32:02 Call Trace:
      2012-11-30 16:32:02  [<ffffffffa071c590>] ? ldlm_expired_completion_wait+0x0/0x270 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa04913f1>] ? libcfs_debug_vmsg1+0x41/0x50 [libcfs]
      2012-11-30 16:32:02  [<ffffffffa071c590>] ? ldlm_expired_completion_wait+0x0/0x270 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa048854e>] cfs_waitq_wait+0xe/0x10 [libcfs]
      2012-11-30 16:32:02  [<ffffffffa071fe6a>] ldlm_completion_ast+0x4da/0x690 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
      2012-11-30 16:32:02  [<ffffffffa071f706>] ldlm_cli_enqueue_local+0x1e6/0x470 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa071f990>] ? ldlm_completion_ast+0x0/0x690 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa0c59180>] ? mdt_blocking_ast+0x0/0x230 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c5ae5f>] mdt_object_lock+0x28f/0x980 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c59180>] ? mdt_blocking_ast+0x0/0x230 [mdt]
      2012-11-30 16:32:02  [<ffffffffa071f990>] ? ldlm_completion_ast+0x0/0x690 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa0c5b871>] mdt_object_find_lock+0x61/0x100 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c70fe2>] mdt_md_create+0x102/0x5a0 [mdt]
      2012-11-30 16:32:02  [<ffffffffa03af96c>] ? lprocfs_counter_add+0x11c/0x190 [lvfs]
      2012-11-30 16:32:02  [<ffffffffa0c71598>] mdt_reint_create+0x118/0x5e0 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c6f2d0>] mdt_reint_rec+0x40/0xb0 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0740eb4>] ? lustre_msg_get_flags+0x34/0x70 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa0c6a0c8>] mdt_reint_internal+0x4f8/0x770 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c6a384>] mdt_reint+0x44/0xc0 [mdt]
      2012-11-30 16:32:03  [<ffffffffa0c5e79d>] mdt_handle_common+0x73d/0x12c0 [mdt]
      2012-11-30 16:32:03  [<ffffffffa0740cc4>] ? lustre_msg_get_transno+0x54/0x90 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffffa0c5f3f5>] mdt_regular_handle+0x15/0x20 [mdt]
      2012-11-30 16:32:03  [<ffffffffa074cd64>] ptlrpc_main+0xd24/0x1740 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffff8100c14a>] child_rip+0xa/0x20
      2012-11-30 16:32:03  [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      

      See attached file console.momus-mds1.txt for more of the console log, including backtraces from the processes on the system.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                bzzz Alex Zhuravlev
                Reporter:
                morrone Christopher Morrone
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: