Details
-
Bug
-
Resolution: Unresolved
-
Blocker
-
None
-
Lustre 2.12.0
-
None
-
CentOS 7.6, Lustre 2.12.0, 3.10.0-957.1.3.el7_lustre.x86_64
-
3
-
9223372036854775807
Description
We upgraded our cluster Sherlock to Lustre 2.12 yesterday and put Fir (Lustre 2.12 servers) into production yesterday but this morning, the filesystem is unusable due to a super high load of ldlm threads on the MDS servers.
I can see plenty of these on the MDS:
[Wed Feb 6 09:19:25 2019][1695265.110496] LNet: Service thread pid 35530 was inactive for 350.85s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [Wed Feb 6 09:19:25 2019][1695265.127609] Pid: 35530, comm: mdt02_024 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 [Wed Feb 6 09:19:25 2019][1695265.137522] Call Trace: [Wed Feb 6 09:19:25 2019][1695265.140184] [<ffffffffc0e3a0bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc] [Wed Feb 6 09:19:25 2019][1695265.147339] [<ffffffffc0e3adcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] [Wed Feb 6 09:19:25 2019][1695265.154723] [<ffffffffc15164ab>] mdt_object_local_lock+0x50b/0xb20 [mdt] [Wed Feb 6 09:19:25 2019][1695265.161731] [<ffffffffc1516b30>] mdt_object_lock_internal+0x70/0x3e0 [mdt] [Wed Feb 6 09:19:25 2019][1695265.168906] [<ffffffffc1517d1a>] mdt_getattr_name_lock+0x90a/0x1c30 [mdt] [Wed Feb 6 09:19:25 2019][1695265.176002] [<ffffffffc151fbb5>] mdt_intent_getattr+0x2b5/0x480 [mdt] [Wed Feb 6 09:19:25 2019][1695265.182750] [<ffffffffc151ca18>] mdt_intent_policy+0x2e8/0xd00 [mdt] [Wed Feb 6 09:19:25 2019][1695265.189403] [<ffffffffc0e20ec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc] [Wed Feb 6 09:19:25 2019][1695265.196349] [<ffffffffc0e498a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc] [Wed Feb 6 09:19:25 2019][1695265.203643] [<ffffffffc0ed0302>] tgt_enqueue+0x62/0x210 [ptlrpc] [Wed Feb 6 09:19:25 2019][1695265.210001] [<ffffffffc0ed735a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [Wed Feb 6 09:19:25 2019][1695265.217133] [<ffffffffc0e7b92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [Wed Feb 6 09:19:25 2019][1695265.225045] [<ffffffffc0e7f25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [Wed Feb 6 09:19:25 2019][1695265.231563] [<ffffffffadcc1c31>] kthread+0xd1/0xe0 [Wed Feb 6 09:19:25 2019][1695265.236657] [<ffffffffae374c24>] ret_from_fork_nospec_begin+0xe/0x21 [Wed Feb 6 09:19:25 2019][1695265.243306] [<ffffffffffffffff>] 0xffffffffffffffff [Wed Feb 6 09:19:25 2019][1695265.248518] LustreError: dumping log to /tmp/lustre-log.1549473603.35530
On Fir, we have two Lustre 2.12 MDS, fir-md1-s1 and fir-md1-s2, each with 2 MDTs. I dumped the current tasks using sysrq to the console and I'm attaching the full console log for both MDS servers. The servers don't crash but lead to unsable/blocking clients,however from time to time we can access the filesystem. Any help would be appreciated. Thanks!
Stephane
Attachments
Issue Links
- duplicates
-
LU-11888 Unreachable client NID confusing Lustre 2.12
- Open