Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.1.2
-
Lustre 2.1.2-3chaos (github.com/chaos/lustre)
-
3
-
5736
Description
One of our production MDS is in trouble, causing application hangs. It looks like CPU usage is low, but the node has mdt threads hanging for 800+ seconds before timeout. It is frequently printing backtraces like so:
2012-11-30 16:32:02 Lustre: Service thread pid 4557 was inactive for 808.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging pur poses: 2012-11-30 16:32:02 Lustre: Skipped 4 previous similar messages 2012-11-30 16:32:02 Pid: 4557, comm: mdt_294 2012-11-30 16:32:02 2012-11-30 16:32:02 Call Trace: 2012-11-30 16:32:02 [<ffffffffa071c590>] ? ldlm_expired_completion_wait+0x0/0x270 [ptlrpc] 2012-11-30 16:32:02 [<ffffffffa04913f1>] ? libcfs_debug_vmsg1+0x41/0x50 [libcfs] 2012-11-30 16:32:02 [<ffffffffa071c590>] ? ldlm_expired_completion_wait+0x0/0x270 [ptlrpc] 2012-11-30 16:32:02 [<ffffffffa048854e>] cfs_waitq_wait+0xe/0x10 [libcfs] 2012-11-30 16:32:02 [<ffffffffa071fe6a>] ldlm_completion_ast+0x4da/0x690 [ptlrpc] 2012-11-30 16:32:02 [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20 2012-11-30 16:32:02 [<ffffffffa071f706>] ldlm_cli_enqueue_local+0x1e6/0x470 [ptlrpc] 2012-11-30 16:32:02 [<ffffffffa071f990>] ? ldlm_completion_ast+0x0/0x690 [ptlrpc] 2012-11-30 16:32:02 [<ffffffffa0c59180>] ? mdt_blocking_ast+0x0/0x230 [mdt] 2012-11-30 16:32:02 [<ffffffffa0c5ae5f>] mdt_object_lock+0x28f/0x980 [mdt] 2012-11-30 16:32:02 [<ffffffffa0c59180>] ? mdt_blocking_ast+0x0/0x230 [mdt] 2012-11-30 16:32:02 [<ffffffffa071f990>] ? ldlm_completion_ast+0x0/0x690 [ptlrpc] 2012-11-30 16:32:02 [<ffffffffa0c5b871>] mdt_object_find_lock+0x61/0x100 [mdt] 2012-11-30 16:32:02 [<ffffffffa0c70fe2>] mdt_md_create+0x102/0x5a0 [mdt] 2012-11-30 16:32:02 [<ffffffffa03af96c>] ? lprocfs_counter_add+0x11c/0x190 [lvfs] 2012-11-30 16:32:02 [<ffffffffa0c71598>] mdt_reint_create+0x118/0x5e0 [mdt] 2012-11-30 16:32:02 [<ffffffffa0c6f2d0>] mdt_reint_rec+0x40/0xb0 [mdt] 2012-11-30 16:32:02 [<ffffffffa0740eb4>] ? lustre_msg_get_flags+0x34/0x70 [ptlrpc] 2012-11-30 16:32:02 [<ffffffffa0c6a0c8>] mdt_reint_internal+0x4f8/0x770 [mdt] 2012-11-30 16:32:02 [<ffffffffa0c6a384>] mdt_reint+0x44/0xc0 [mdt] 2012-11-30 16:32:03 [<ffffffffa0c5e79d>] mdt_handle_common+0x73d/0x12c0 [mdt] 2012-11-30 16:32:03 [<ffffffffa0740cc4>] ? lustre_msg_get_transno+0x54/0x90 [ptlrpc] 2012-11-30 16:32:03 [<ffffffffa0c5f3f5>] mdt_regular_handle+0x15/0x20 [mdt] 2012-11-30 16:32:03 [<ffffffffa074cd64>] ptlrpc_main+0xd24/0x1740 [ptlrpc] 2012-11-30 16:32:03 [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc] 2012-11-30 16:32:03 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-11-30 16:32:03 [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc] 2012-11-30 16:32:03 [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc] 2012-11-30 16:32:03 [<ffffffff8100c140>] ? child_rip+0x0/0x20
See attached file console.momus-mds1.txt for more of the console log, including backtraces from the processes on the system.