Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
zfs-0.7
kernel-3.10.0-1160.4.1.1chaos.ch6.x86_64
lustre-2.12.5_10.llnl-3.ch6.x86_64
-
3
-
9223372036854775807
Description
At some time the morning of Jan 27th, we got reports of directory listings ("ls") hanging, where the directories were on MDT5 and MDT7.
The console log of MDT5 and MDT7 both reported repeated watchdog dumps, all with very similar stacks. The first one on MDT7 appeared Thu Jan 21 12:01:33 2021
# From zinc8 dmesg log LNet: Service thread pid 25241 was inactive for 200.13s. The thread might be hung, or it might only be slow and will resume later. D Pid: 25241, comm: mdt01_049 3.10.0-1160.4.1.1chaos.ch6.x86_64 #1 SMP Fri Oct 9 17:56:20 PDT 2020 Call Trace: [<ffffffffc141a460>] ldlm_completion_ast+0x440/0x870 [ptlrpc] [<ffffffffc141be2f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc] [<ffffffffc141ef3e>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc] [<ffffffffc1982342>] osp_md_object_lock+0x162/0x2d0 [osp] [<ffffffffc1895194>] lod_object_lock+0xf4/0x780 [lod] [<ffffffffc1916ace>] mdd_object_lock+0x3e/0xe0 [mdd] [<ffffffffc17ae681>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt] [<ffffffffc17aec1a>] mdt_remote_object_lock+0x2a/0x30 [mdt] [<ffffffffc17c407e>] mdt_rename_lock+0xbe/0x4b0 [mdt] [<ffffffffc17c6400>] mdt_reint_rename+0x2c0/0x2900 [mdt] [<ffffffffc17cf113>] mdt_reint_rec+0x83/0x210 [mdt] [<ffffffffc17ab303>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [<ffffffffc17b6b37>] mdt_reint+0x67/0x140 [mdt] [<ffffffffc14b8b1a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [<ffffffffc145d80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [<ffffffffc1461bfd>] ptlrpc_main+0xc4d/0x2280 [ptlrpc] [<ffffffffadecafc1>] kthread+0xd1/0xe0 [<ffffffffae5c1ff7>] ret_from_fork_nospec_end+0x0/0x39 [<ffffffffffffffff>] 0xffffffffffffffff LustreError: dumping log to /tmp/lustre-log.1611259694.25241
The remaining dumps took a different path within ldlm_cli_enqueue:
ptlrpc_set_wait+0x4d8/0x800 [ptlrpc] ptlrpc_queue_wait+0x83/0x230 [ptlrpc] ldlm_cli_enqueue+0x3d2/0x920 [ptlrpc] osp_md_object_lock+0x162/0x2d0 [osp] lod_object_lock+0xf4/0x780 [lod] mdd_object_lock+0x3e/0xe0 [mdd] mdt_remote_object_lock_try+0x1e1/0x750 [mdt] mdt_remote_object_lock+0x2a/0x30 [mdt] mdt_rename_lock+0xbe/0x4b0 [mdt] mdt_reint_rename+0x2c0/0x2900 [mdt] mdt_reint_rec+0x83/0x210 [mdt] mdt_reint_internal+0x6e3/0xaf0 [mdt] mdt_reint+0x67/0x140 [mdt] tgt_request_handle+0xada/0x1570 [ptlrpc] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] ptlrpc_main+0xc4d/0x2280 [ptlrpc] kthread+0xd1/0xe0 ret_from_fork_nospec_end+0x0/0x39 0xffffffffffffffff
I can provide mdt7 debug logs and ldlm namespace dump; I have a core dump from mdt5. And dmesg logs for both.
Attachments
Issue Links
- is related to
-
LU-14378 ldlm_resource_complain()) MGC172.19.3.1@o2ib600: namespace resource [0x68736c:0x2:0x0].0x0 (ffff972b9abea0c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
- Closed