[LU-2419] mdt threads stuck in ldlm_expired_completion_wait Created: 30/Nov/12 Updated: 16/Apr/20 Resolved: 16/Apr/20 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Christopher Morrone | Assignee: | Alex Zhuravlev |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
Lustre 2.1.2-3chaos (github.com/chaos/lustre) |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 5736 | ||||||||||||||||
| Description |
|
One of our production MDS is in trouble, causing application hangs. It looks like CPU usage is low, but the node has mdt threads hanging for 800+ seconds before timeout. It is frequently printing backtraces like so: 2012-11-30 16:32:02 Lustre: Service thread pid 4557 was inactive for 808.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging pur poses: 2012-11-30 16:32:02 Lustre: Skipped 4 previous similar messages 2012-11-30 16:32:02 Pid: 4557, comm: mdt_294 2012-11-30 16:32:02 2012-11-30 16:32:02 Call Trace: 2012-11-30 16:32:02 [<ffffffffa071c590>] ? ldlm_expired_completion_wait+0x0/0x270 [ptlrpc] 2012-11-30 16:32:02 [<ffffffffa04913f1>] ? libcfs_debug_vmsg1+0x41/0x50 [libcfs] 2012-11-30 16:32:02 [<ffffffffa071c590>] ? ldlm_expired_completion_wait+0x0/0x270 [ptlrpc] 2012-11-30 16:32:02 [<ffffffffa048854e>] cfs_waitq_wait+0xe/0x10 [libcfs] 2012-11-30 16:32:02 [<ffffffffa071fe6a>] ldlm_completion_ast+0x4da/0x690 [ptlrpc] 2012-11-30 16:32:02 [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20 2012-11-30 16:32:02 [<ffffffffa071f706>] ldlm_cli_enqueue_local+0x1e6/0x470 [ptlrpc] 2012-11-30 16:32:02 [<ffffffffa071f990>] ? ldlm_completion_ast+0x0/0x690 [ptlrpc] 2012-11-30 16:32:02 [<ffffffffa0c59180>] ? mdt_blocking_ast+0x0/0x230 [mdt] 2012-11-30 16:32:02 [<ffffffffa0c5ae5f>] mdt_object_lock+0x28f/0x980 [mdt] 2012-11-30 16:32:02 [<ffffffffa0c59180>] ? mdt_blocking_ast+0x0/0x230 [mdt] 2012-11-30 16:32:02 [<ffffffffa071f990>] ? ldlm_completion_ast+0x0/0x690 [ptlrpc] 2012-11-30 16:32:02 [<ffffffffa0c5b871>] mdt_object_find_lock+0x61/0x100 [mdt] 2012-11-30 16:32:02 [<ffffffffa0c70fe2>] mdt_md_create+0x102/0x5a0 [mdt] 2012-11-30 16:32:02 [<ffffffffa03af96c>] ? lprocfs_counter_add+0x11c/0x190 [lvfs] 2012-11-30 16:32:02 [<ffffffffa0c71598>] mdt_reint_create+0x118/0x5e0 [mdt] 2012-11-30 16:32:02 [<ffffffffa0c6f2d0>] mdt_reint_rec+0x40/0xb0 [mdt] 2012-11-30 16:32:02 [<ffffffffa0740eb4>] ? lustre_msg_get_flags+0x34/0x70 [ptlrpc] 2012-11-30 16:32:02 [<ffffffffa0c6a0c8>] mdt_reint_internal+0x4f8/0x770 [mdt] 2012-11-30 16:32:02 [<ffffffffa0c6a384>] mdt_reint+0x44/0xc0 [mdt] 2012-11-30 16:32:03 [<ffffffffa0c5e79d>] mdt_handle_common+0x73d/0x12c0 [mdt] 2012-11-30 16:32:03 [<ffffffffa0740cc4>] ? lustre_msg_get_transno+0x54/0x90 [ptlrpc] 2012-11-30 16:32:03 [<ffffffffa0c5f3f5>] mdt_regular_handle+0x15/0x20 [mdt] 2012-11-30 16:32:03 [<ffffffffa074cd64>] ptlrpc_main+0xd24/0x1740 [ptlrpc] 2012-11-30 16:32:03 [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc] 2012-11-30 16:32:03 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-11-30 16:32:03 [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc] 2012-11-30 16:32:03 [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc] 2012-11-30 16:32:03 [<ffffffff8100c140>] ? child_rip+0x0/0x20 See attached file console.momus-mds1.txt for more of the console log, including backtraces from the processes on the system. |
| Comments |
| Comment by Peter Jones [ 01/Dec/12 ] |
|
Alex could you please assess this one? |
| Comment by Alex Zhuravlev [ 06/Dec/12 ] |
|
2012-11-30 14:38:46 LustreError: 4187:0:(mdt_recovery.c:1011:mdt_steal_ack_locks()) Resent req xid 1417917631054119 has mismatched opc: new 101 old 0 this does not look normal: resents from the same client (172.16.65.148@tcp) within 1s ? given mdc semaphore is serializing requests on the client side. the backtraces were result of ldlm lock enqueues taking too long. |