Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.1.2
-
Lustre 2.1.2-3chaos (github.com/chaos/lustre)
-
3
-
5736
Description
One of our production MDS is in trouble, causing application hangs. It looks like CPU usage is low, but the node has mdt threads hanging for 800+ seconds before timeout. It is frequently printing backtraces like so:
2012-11-30 16:32:02 Lustre: Service thread pid 4557 was inactive for 808.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging pur poses: 2012-11-30 16:32:02 Lustre: Skipped 4 previous similar messages 2012-11-30 16:32:02 Pid: 4557, comm: mdt_294 2012-11-30 16:32:02 2012-11-30 16:32:02 Call Trace: 2012-11-30 16:32:02 [<ffffffffa071c590>] ? ldlm_expired_completion_wait+0x0/0x270 [ptlrpc] 2012-11-30 16:32:02 [<ffffffffa04913f1>] ? libcfs_debug_vmsg1+0x41/0x50 [libcfs] 2012-11-30 16:32:02 [<ffffffffa071c590>] ? ldlm_expired_completion_wait+0x0/0x270 [ptlrpc] 2012-11-30 16:32:02 [<ffffffffa048854e>] cfs_waitq_wait+0xe/0x10 [libcfs] 2012-11-30 16:32:02 [<ffffffffa071fe6a>] ldlm_completion_ast+0x4da/0x690 [ptlrpc] 2012-11-30 16:32:02 [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20 2012-11-30 16:32:02 [<ffffffffa071f706>] ldlm_cli_enqueue_local+0x1e6/0x470 [ptlrpc] 2012-11-30 16:32:02 [<ffffffffa071f990>] ? ldlm_completion_ast+0x0/0x690 [ptlrpc] 2012-11-30 16:32:02 [<ffffffffa0c59180>] ? mdt_blocking_ast+0x0/0x230 [mdt] 2012-11-30 16:32:02 [<ffffffffa0c5ae5f>] mdt_object_lock+0x28f/0x980 [mdt] 2012-11-30 16:32:02 [<ffffffffa0c59180>] ? mdt_blocking_ast+0x0/0x230 [mdt] 2012-11-30 16:32:02 [<ffffffffa071f990>] ? ldlm_completion_ast+0x0/0x690 [ptlrpc] 2012-11-30 16:32:02 [<ffffffffa0c5b871>] mdt_object_find_lock+0x61/0x100 [mdt] 2012-11-30 16:32:02 [<ffffffffa0c70fe2>] mdt_md_create+0x102/0x5a0 [mdt] 2012-11-30 16:32:02 [<ffffffffa03af96c>] ? lprocfs_counter_add+0x11c/0x190 [lvfs] 2012-11-30 16:32:02 [<ffffffffa0c71598>] mdt_reint_create+0x118/0x5e0 [mdt] 2012-11-30 16:32:02 [<ffffffffa0c6f2d0>] mdt_reint_rec+0x40/0xb0 [mdt] 2012-11-30 16:32:02 [<ffffffffa0740eb4>] ? lustre_msg_get_flags+0x34/0x70 [ptlrpc] 2012-11-30 16:32:02 [<ffffffffa0c6a0c8>] mdt_reint_internal+0x4f8/0x770 [mdt] 2012-11-30 16:32:02 [<ffffffffa0c6a384>] mdt_reint+0x44/0xc0 [mdt] 2012-11-30 16:32:03 [<ffffffffa0c5e79d>] mdt_handle_common+0x73d/0x12c0 [mdt] 2012-11-30 16:32:03 [<ffffffffa0740cc4>] ? lustre_msg_get_transno+0x54/0x90 [ptlrpc] 2012-11-30 16:32:03 [<ffffffffa0c5f3f5>] mdt_regular_handle+0x15/0x20 [mdt] 2012-11-30 16:32:03 [<ffffffffa074cd64>] ptlrpc_main+0xd24/0x1740 [ptlrpc] 2012-11-30 16:32:03 [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc] 2012-11-30 16:32:03 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-11-30 16:32:03 [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc] 2012-11-30 16:32:03 [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc] 2012-11-30 16:32:03 [<ffffffff8100c140>] ? child_rip+0x0/0x20
See attached file console.momus-mds1.txt for more of the console log, including backtraces from the processes on the system.
2012-11-30 14:38:46 LustreError: 4187:0:(mdt_recovery.c:1011:mdt_steal_ack_locks()) Resent req xid 1417917631054119 has mismatched opc: new 101 old 0
2012-11-30 14:38:46 Lustre: All locks stolen from rs ffff880129196000 x1417917631054119.t281940978801 o0 NID 172.16.65.148@tcp
2012-11-30 14:38:47 LustreError: 4303:0:(mdt_recovery.c:1011:mdt_steal_ack_locks()) Resent req xid 1417917631054855 has mismatched opc: new 101 old 0
2012-11-30 14:38:47 Lustre: All locks stolen from rs ffff8801567d8000 x1417917631054855.t281940981086 o0 NID 172.16.65.148@tcp
this does not look normal: resents from the same client (172.16.65.148@tcp) within 1s ? given mdc semaphore is serializing requests on the client side.
the backtraces were result of ldlm lock enqueues taking too long.