Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2419

mdt threads stuck in ldlm_expired_completion_wait

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.1.2
    • Lustre 2.1.2-3chaos (github.com/chaos/lustre)
    • 3
    • 5736

    Description

      One of our production MDS is in trouble, causing application hangs. It looks like CPU usage is low, but the node has mdt threads hanging for 800+ seconds before timeout. It is frequently printing backtraces like so:

      2012-11-30 16:32:02 Lustre: Service thread pid 4557 was inactive for 808.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging pur
      poses:
      2012-11-30 16:32:02 Lustre: Skipped 4 previous similar messages
      2012-11-30 16:32:02 Pid: 4557, comm: mdt_294
      2012-11-30 16:32:02 
      2012-11-30 16:32:02 Call Trace:
      2012-11-30 16:32:02  [<ffffffffa071c590>] ? ldlm_expired_completion_wait+0x0/0x270 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa04913f1>] ? libcfs_debug_vmsg1+0x41/0x50 [libcfs]
      2012-11-30 16:32:02  [<ffffffffa071c590>] ? ldlm_expired_completion_wait+0x0/0x270 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa048854e>] cfs_waitq_wait+0xe/0x10 [libcfs]
      2012-11-30 16:32:02  [<ffffffffa071fe6a>] ldlm_completion_ast+0x4da/0x690 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
      2012-11-30 16:32:02  [<ffffffffa071f706>] ldlm_cli_enqueue_local+0x1e6/0x470 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa071f990>] ? ldlm_completion_ast+0x0/0x690 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa0c59180>] ? mdt_blocking_ast+0x0/0x230 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c5ae5f>] mdt_object_lock+0x28f/0x980 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c59180>] ? mdt_blocking_ast+0x0/0x230 [mdt]
      2012-11-30 16:32:02  [<ffffffffa071f990>] ? ldlm_completion_ast+0x0/0x690 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa0c5b871>] mdt_object_find_lock+0x61/0x100 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c70fe2>] mdt_md_create+0x102/0x5a0 [mdt]
      2012-11-30 16:32:02  [<ffffffffa03af96c>] ? lprocfs_counter_add+0x11c/0x190 [lvfs]
      2012-11-30 16:32:02  [<ffffffffa0c71598>] mdt_reint_create+0x118/0x5e0 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c6f2d0>] mdt_reint_rec+0x40/0xb0 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0740eb4>] ? lustre_msg_get_flags+0x34/0x70 [ptlrpc]
      2012-11-30 16:32:02  [<ffffffffa0c6a0c8>] mdt_reint_internal+0x4f8/0x770 [mdt]
      2012-11-30 16:32:02  [<ffffffffa0c6a384>] mdt_reint+0x44/0xc0 [mdt]
      2012-11-30 16:32:03  [<ffffffffa0c5e79d>] mdt_handle_common+0x73d/0x12c0 [mdt]
      2012-11-30 16:32:03  [<ffffffffa0740cc4>] ? lustre_msg_get_transno+0x54/0x90 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffffa0c5f3f5>] mdt_regular_handle+0x15/0x20 [mdt]
      2012-11-30 16:32:03  [<ffffffffa074cd64>] ptlrpc_main+0xd24/0x1740 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffff8100c14a>] child_rip+0xa/0x20
      2012-11-30 16:32:03  [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
      2012-11-30 16:32:03  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      

      See attached file console.momus-mds1.txt for more of the console log, including backtraces from the processes on the system.

      Attachments

        Issue Links

          Activity

            [LU-2419] mdt threads stuck in ldlm_expired_completion_wait

            2012-11-30 14:38:46 LustreError: 4187:0:(mdt_recovery.c:1011:mdt_steal_ack_locks()) Resent req xid 1417917631054119 has mismatched opc: new 101 old 0
            2012-11-30 14:38:46 Lustre: All locks stolen from rs ffff880129196000 x1417917631054119.t281940978801 o0 NID 172.16.65.148@tcp
            2012-11-30 14:38:47 LustreError: 4303:0:(mdt_recovery.c:1011:mdt_steal_ack_locks()) Resent req xid 1417917631054855 has mismatched opc: new 101 old 0
            2012-11-30 14:38:47 Lustre: All locks stolen from rs ffff8801567d8000 x1417917631054855.t281940981086 o0 NID 172.16.65.148@tcp

            this does not look normal: resents from the same client (172.16.65.148@tcp) within 1s ? given mdc semaphore is serializing requests on the client side.

            the backtraces were result of ldlm lock enqueues taking too long.

            bzzz Alex Zhuravlev added a comment - 2012-11-30 14:38:46 LustreError: 4187:0:(mdt_recovery.c:1011:mdt_steal_ack_locks()) Resent req xid 1417917631054119 has mismatched opc: new 101 old 0 2012-11-30 14:38:46 Lustre: All locks stolen from rs ffff880129196000 x1417917631054119.t281940978801 o0 NID 172.16.65.148@tcp 2012-11-30 14:38:47 LustreError: 4303:0:(mdt_recovery.c:1011:mdt_steal_ack_locks()) Resent req xid 1417917631054855 has mismatched opc: new 101 old 0 2012-11-30 14:38:47 Lustre: All locks stolen from rs ffff8801567d8000 x1417917631054855.t281940981086 o0 NID 172.16.65.148@tcp this does not look normal: resents from the same client (172.16.65.148@tcp) within 1s ? given mdc semaphore is serializing requests on the client side. the backtraces were result of ldlm lock enqueues taking too long.
            pjones Peter Jones added a comment -

            Alex could you please assess this one?

            pjones Peter Jones added a comment - Alex could you please assess this one?

            People

              bzzz Alex Zhuravlev
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: