[LU-2419] mdt threads stuck in ldlm_expired_completion_wait Created: 30/Nov/12  Updated: 16/Apr/20  Resolved: 16/Apr/20

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Christopher Morrone Assignee: Alex Zhuravlev
Resolution: Cannot Reproduce Votes: 0
Labels: llnl
Environment:

Lustre 2.1.2-3chaos (github.com/chaos/lustre)


Attachments: Text File console.momus-mds1.txt    
Issue Links:
Duplicate
is duplicated by LU-4572 hung mdt threads Resolved
Related
is related to LU-2944 Client evictions - watchdog timeouts ... Resolved
Severity: 3
Rank (Obsolete): 5736

 Description   

One of our production MDS is in trouble, causing application hangs. It looks like CPU usage is low, but the node has mdt threads hanging for 800+ seconds before timeout. It is frequently printing backtraces like so:

2012-11-30 16:32:02 Lustre: Service thread pid 4557 was inactive for 808.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging pur
poses:
2012-11-30 16:32:02 Lustre: Skipped 4 previous similar messages
2012-11-30 16:32:02 Pid: 4557, comm: mdt_294
2012-11-30 16:32:02 
2012-11-30 16:32:02 Call Trace:
2012-11-30 16:32:02  [<ffffffffa071c590>] ? ldlm_expired_completion_wait+0x0/0x270 [ptlrpc]
2012-11-30 16:32:02  [<ffffffffa04913f1>] ? libcfs_debug_vmsg1+0x41/0x50 [libcfs]
2012-11-30 16:32:02  [<ffffffffa071c590>] ? ldlm_expired_completion_wait+0x0/0x270 [ptlrpc]
2012-11-30 16:32:02  [<ffffffffa048854e>] cfs_waitq_wait+0xe/0x10 [libcfs]
2012-11-30 16:32:02  [<ffffffffa071fe6a>] ldlm_completion_ast+0x4da/0x690 [ptlrpc]
2012-11-30 16:32:02  [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
2012-11-30 16:32:02  [<ffffffffa071f706>] ldlm_cli_enqueue_local+0x1e6/0x470 [ptlrpc]
2012-11-30 16:32:02  [<ffffffffa071f990>] ? ldlm_completion_ast+0x0/0x690 [ptlrpc]
2012-11-30 16:32:02  [<ffffffffa0c59180>] ? mdt_blocking_ast+0x0/0x230 [mdt]
2012-11-30 16:32:02  [<ffffffffa0c5ae5f>] mdt_object_lock+0x28f/0x980 [mdt]
2012-11-30 16:32:02  [<ffffffffa0c59180>] ? mdt_blocking_ast+0x0/0x230 [mdt]
2012-11-30 16:32:02  [<ffffffffa071f990>] ? ldlm_completion_ast+0x0/0x690 [ptlrpc]
2012-11-30 16:32:02  [<ffffffffa0c5b871>] mdt_object_find_lock+0x61/0x100 [mdt]
2012-11-30 16:32:02  [<ffffffffa0c70fe2>] mdt_md_create+0x102/0x5a0 [mdt]
2012-11-30 16:32:02  [<ffffffffa03af96c>] ? lprocfs_counter_add+0x11c/0x190 [lvfs]
2012-11-30 16:32:02  [<ffffffffa0c71598>] mdt_reint_create+0x118/0x5e0 [mdt]
2012-11-30 16:32:02  [<ffffffffa0c6f2d0>] mdt_reint_rec+0x40/0xb0 [mdt]
2012-11-30 16:32:02  [<ffffffffa0740eb4>] ? lustre_msg_get_flags+0x34/0x70 [ptlrpc]
2012-11-30 16:32:02  [<ffffffffa0c6a0c8>] mdt_reint_internal+0x4f8/0x770 [mdt]
2012-11-30 16:32:02  [<ffffffffa0c6a384>] mdt_reint+0x44/0xc0 [mdt]
2012-11-30 16:32:03  [<ffffffffa0c5e79d>] mdt_handle_common+0x73d/0x12c0 [mdt]
2012-11-30 16:32:03  [<ffffffffa0740cc4>] ? lustre_msg_get_transno+0x54/0x90 [ptlrpc]
2012-11-30 16:32:03  [<ffffffffa0c5f3f5>] mdt_regular_handle+0x15/0x20 [mdt]
2012-11-30 16:32:03  [<ffffffffa074cd64>] ptlrpc_main+0xd24/0x1740 [ptlrpc]
2012-11-30 16:32:03  [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
2012-11-30 16:32:03  [<ffffffff8100c14a>] child_rip+0xa/0x20
2012-11-30 16:32:03  [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
2012-11-30 16:32:03  [<ffffffffa074c040>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
2012-11-30 16:32:03  [<ffffffff8100c140>] ? child_rip+0x0/0x20

See attached file console.momus-mds1.txt for more of the console log, including backtraces from the processes on the system.



 Comments   
Comment by Peter Jones [ 01/Dec/12 ]

Alex could you please assess this one?

Comment by Alex Zhuravlev [ 06/Dec/12 ]

2012-11-30 14:38:46 LustreError: 4187:0:(mdt_recovery.c:1011:mdt_steal_ack_locks()) Resent req xid 1417917631054119 has mismatched opc: new 101 old 0
2012-11-30 14:38:46 Lustre: All locks stolen from rs ffff880129196000 x1417917631054119.t281940978801 o0 NID 172.16.65.148@tcp
2012-11-30 14:38:47 LustreError: 4303:0:(mdt_recovery.c:1011:mdt_steal_ack_locks()) Resent req xid 1417917631054855 has mismatched opc: new 101 old 0
2012-11-30 14:38:47 Lustre: All locks stolen from rs ffff8801567d8000 x1417917631054855.t281940981086 o0 NID 172.16.65.148@tcp

this does not look normal: resents from the same client (172.16.65.148@tcp) within 1s ? given mdc semaphore is serializing requests on the client side.

the backtraces were result of ldlm lock enqueues taking too long.

Generated at Sat Feb 10 01:25:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.