Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.12.0
Labels:
None
Environment:
CentOS 7.6, Lustre 2.12.0, 3.10.0-957.1.3.el7_lustre.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We upgraded our cluster Sherlock to Lustre 2.12 yesterday and put Fir (Lustre 2.12 servers) into production yesterday but this morning, the filesystem is unusable due to a super high load of ldlm threads on the MDS servers.

I can see plenty of these on the MDS:

[Wed Feb  6 09:19:25 2019][1695265.110496] LNet: Service thread pid 35530 was inactive for 350.85s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[Wed Feb  6 09:19:25 2019][1695265.127609] Pid: 35530, comm: mdt02_024 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
[Wed Feb  6 09:19:25 2019][1695265.137522] Call Trace:
[Wed Feb  6 09:19:25 2019][1695265.140184]  [<ffffffffc0e3a0bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
[Wed Feb  6 09:19:25 2019][1695265.147339]  [<ffffffffc0e3adcc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
[Wed Feb  6 09:19:25 2019][1695265.154723]  [<ffffffffc15164ab>] mdt_object_local_lock+0x50b/0xb20 [mdt]
[Wed Feb  6 09:19:25 2019][1695265.161731]  [<ffffffffc1516b30>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
[Wed Feb  6 09:19:25 2019][1695265.168906]  [<ffffffffc1517d1a>] mdt_getattr_name_lock+0x90a/0x1c30 [mdt]
[Wed Feb  6 09:19:25 2019][1695265.176002]  [<ffffffffc151fbb5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
[Wed Feb  6 09:19:25 2019][1695265.182750]  [<ffffffffc151ca18>] mdt_intent_policy+0x2e8/0xd00 [mdt]
[Wed Feb  6 09:19:25 2019][1695265.189403]  [<ffffffffc0e20ec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
[Wed Feb  6 09:19:25 2019][1695265.196349]  [<ffffffffc0e498a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
[Wed Feb  6 09:19:25 2019][1695265.203643]  [<ffffffffc0ed0302>] tgt_enqueue+0x62/0x210 [ptlrpc]
[Wed Feb  6 09:19:25 2019][1695265.210001]  [<ffffffffc0ed735a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[Wed Feb  6 09:19:25 2019][1695265.217133]  [<ffffffffc0e7b92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[Wed Feb  6 09:19:25 2019][1695265.225045]  [<ffffffffc0e7f25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[Wed Feb  6 09:19:25 2019][1695265.231563]  [<ffffffffadcc1c31>] kthread+0xd1/0xe0
[Wed Feb  6 09:19:25 2019][1695265.236657]  [<ffffffffae374c24>] ret_from_fork_nospec_begin+0xe/0x21
[Wed Feb  6 09:19:25 2019][1695265.243306]  [<ffffffffffffffff>] 0xffffffffffffffff
[Wed Feb  6 09:19:25 2019][1695265.248518] LustreError: dumping log to /tmp/lustre-log.1549473603.35530

On Fir, we have two Lustre 2.12 MDS, fir-md1-s1 and fir-md1-s2, each with 2 MDTs. I dumped the current tasks using sysrq to the console and I'm attaching the full console log for both MDS servers. The servers don't crash but lead to unsable/blocking clients,however from time to time we can access the filesystem. Any help would be appreciated. Thanks!

Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

fir-md1-s1_20191022.log
2.94 MB
22/Oct/19 6:28 PM
fir-md1-s1-console.log
4.22 MB
06/Feb/19 5:52 PM
fir-md1-s2-console.log
2.85 MB
06/Feb/19 5:53 PM
lustre-log.1549473603.35530.gz
1.08 MB
06/Feb/19 5:53 PM

Issue Links

duplicates

LU-11888 Unreachable client NID confusing Lustre 2.12

Open

Activity

People

Assignee:: Serguei Smirnov

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 06/Feb/19 5:54 PM

Updated:: 27/Nov/19 4:06 PM