Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2419

mdt threads stuck in ldlm_expired_completion_wait

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Major Major
    • None
    • Lustre 2.1.2
    • Lustre 2.1.2-3chaos (github.com/chaos/lustre)
    • 3
    • 5736

      One of our production MDS is in trouble, causing application hangs. It looks like CPU usage is low, but the node has mdt threads hanging for 800+ seconds before timeout. It is frequently printing backtraces like so:

      2012-11-30 16:32:02 Lustre: Service thread pid 4557 was inactive for 808.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging pur
      poses:
      2012-11-30 16:32:02 Lustre: Skipped 4 previous similar messages
      2012-11-30 16:32:02 Pid: 4557, comm: mdt_294
      2012-11-30 16:32:02 
      2012-11-30 16:32:02 Call Trace:
      2012-11-30 16:32:02  [<ffffffffa071c590>] ? ldlm_expired_completion_wait+0x0/0x270 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa04913f1>] ? libcfs_debug_vmsg1+0x41/0x50 [libcfs]
      2012-11-30 16:32:02  [<ffffffffa071c590>] ? ldlm_expired_completion_wait+0x0/0x270 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa048854e>] cfs_waitq_wait+0xe/0x10 [libcfs]
      2012-11-30 16:32:02  [<ffffffffa071fe6a>] ldlm_completion_ast+0x4da/0x690 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
      2012-11-30 16:32:02  [<ffffffffa071f706>] ldlm_cli_enqueue_local+0x1e6/0x470 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa071f990>] ? ldlm_completion_ast+0x0/0x690 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa0c59180>] ? mdt_blocking_ast+0x0/0x230 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c5ae5f>] mdt_object_lock+0x28f/0x980 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c59180>] ? mdt_blocking_ast+0x0/0x230 [mdt]
      2012-11-30 16:32:02  [<ffffffffa071f990>] ? ldlm_completion_ast+0x0/0x690 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa0c5b871>] mdt_object_find_lock+0x61/0x100 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c70fe2>] mdt_md_create+0x102/0x5a0 [mdt]
      2012-11-30 16:32:02  [<ffffffffa03af96c>] ? lprocfs_counter_add+0x11c/0x190 [lvfs]
      2012-11-30 16:32:02  [<ffffffffa0c71598>] mdt_reint_create+0x118/0x5e0 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c6f2d0>] mdt_reint_rec+0x40/0xb0 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0740eb4>] ? lustre_msg_get_flags+0x34/0x70 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa0c6a0c8>] mdt_reint_internal+0x4f8/0x770 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c6a384>] mdt_reint+0x44/0xc0 [mdt]
      2012-11-30 16:32:03  [<ffffffffa0c5e79d>] mdt_handle_common+0x73d/0x12c0 [mdt]
      2012-11-30 16:32:03  [<ffffffffa0740cc4>] ? lustre_msg_get_transno+0x54/0x90 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffffa0c5f3f5>] mdt_regular_handle+0x15/0x20 [mdt]
      2012-11-30 16:32:03  [<ffffffffa074cd64>] ptlrpc_main+0xd24/0x1740 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffff8100c14a>] child_rip+0xa/0x20
      2012-11-30 16:32:03  [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      

      See attached file console.momus-mds1.txt for more of the console log, including backtraces from the processes on the system.

            bzzz Alex Zhuravlev
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: