Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2419

mdt threads stuck in ldlm_expired_completion_wait

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.1.2
    • Lustre 2.1.2-3chaos (github.com/chaos/lustre)
    • 3
    • 5736

    Description

      One of our production MDS is in trouble, causing application hangs. It looks like CPU usage is low, but the node has mdt threads hanging for 800+ seconds before timeout. It is frequently printing backtraces like so:

      2012-11-30 16:32:02 Lustre: Service thread pid 4557 was inactive for 808.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging pur
      poses:
      2012-11-30 16:32:02 Lustre: Skipped 4 previous similar messages
      2012-11-30 16:32:02 Pid: 4557, comm: mdt_294
      2012-11-30 16:32:02 
      2012-11-30 16:32:02 Call Trace:
      2012-11-30 16:32:02  [<ffffffffa071c590>] ? ldlm_expired_completion_wait+0x0/0x270 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa04913f1>] ? libcfs_debug_vmsg1+0x41/0x50 [libcfs]
      2012-11-30 16:32:02  [<ffffffffa071c590>] ? ldlm_expired_completion_wait+0x0/0x270 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa048854e>] cfs_waitq_wait+0xe/0x10 [libcfs]
      2012-11-30 16:32:02  [<ffffffffa071fe6a>] ldlm_completion_ast+0x4da/0x690 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
      2012-11-30 16:32:02  [<ffffffffa071f706>] ldlm_cli_enqueue_local+0x1e6/0x470 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa071f990>] ? ldlm_completion_ast+0x0/0x690 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa0c59180>] ? mdt_blocking_ast+0x0/0x230 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c5ae5f>] mdt_object_lock+0x28f/0x980 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c59180>] ? mdt_blocking_ast+0x0/0x230 [mdt]
      2012-11-30 16:32:02  [<ffffffffa071f990>] ? ldlm_completion_ast+0x0/0x690 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa0c5b871>] mdt_object_find_lock+0x61/0x100 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c70fe2>] mdt_md_create+0x102/0x5a0 [mdt]
      2012-11-30 16:32:02  [<ffffffffa03af96c>] ? lprocfs_counter_add+0x11c/0x190 [lvfs]
      2012-11-30 16:32:02  [<ffffffffa0c71598>] mdt_reint_create+0x118/0x5e0 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c6f2d0>] mdt_reint_rec+0x40/0xb0 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0740eb4>] ? lustre_msg_get_flags+0x34/0x70 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa0c6a0c8>] mdt_reint_internal+0x4f8/0x770 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c6a384>] mdt_reint+0x44/0xc0 [mdt]
      2012-11-30 16:32:03  [<ffffffffa0c5e79d>] mdt_handle_common+0x73d/0x12c0 [mdt]
      2012-11-30 16:32:03  [<ffffffffa0740cc4>] ? lustre_msg_get_transno+0x54/0x90 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffffa0c5f3f5>] mdt_regular_handle+0x15/0x20 [mdt]
      2012-11-30 16:32:03  [<ffffffffa074cd64>] ptlrpc_main+0xd24/0x1740 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffff8100c14a>] child_rip+0xa/0x20
      2012-11-30 16:32:03  [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      

      See attached file console.momus-mds1.txt for more of the console log, including backtraces from the processes on the system.

      Attachments

        Issue Links

          Activity

            [LU-2419] mdt threads stuck in ldlm_expired_completion_wait
            bzzz Alex Zhuravlev made changes -
            Resolution New: Cannot Reproduce [ 5 ]
            Status Original: Open [ 1 ] New: Closed [ 6 ]
            pjones Peter Jones made changes -
            End date New: 26/Jun/14
            Start date New: 30/Nov/12
            morrone Christopher Morrone (Inactive) made changes -
            Labels New: llnl
            utopiabound Nathaniel Clark made changes -
            Link New: This issue is related to LU-2944 [ LU-2944 ]
            utopiabound Nathaniel Clark made changes -
            Link New: This issue is duplicated by LU-4572 [ LU-4572 ]
            pjones Peter Jones made changes -
            Assignee Original: WC Triage [ wc-triage ] New: Alex Zhuravlev [ bzzz ]
            morrone Christopher Morrone (Inactive) made changes -
            Summary Original: mdt threads stuck in New: mdt threads stuck in ldlm_expired_completion_wait
            morrone Christopher Morrone (Inactive) made changes -
            Summary Original: MDS New: mdt threads stuck in
            morrone Christopher Morrone (Inactive) created issue -

            People

              bzzz Alex Zhuravlev
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: