[LU-11989] Global filesystem hangs in 2.12 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.12.0
Labels:
None
Environment:
CentOS 7.6, Lustre 2.12.0 clients and servers, some clients with 2.12.0 + patch ~~LU-11964~~

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We are having more issues with a full 2.12 production setup on Sherlock and Fir, we can notice sometimes a global filesystem hang, on all nodes, for at least 30 seconds, often more. The filesystem can run fine for 2 hours and then hang during a few minutes. This is impacting production, especially interactive jobs.

These filesystem hangs could be related to compute nodes rebooting and matching messages like the following on the MDTs:

[769459.092993] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784454/real 1550784454]  req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784461 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
[769459.120452] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[769473.130314] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784468/real 1550784468]  req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784475 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
[769473.157759] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[769494.167799] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784489/real 1550784489]  req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784496 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
[769494.195248] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 2 previous similar messages

I'm not 100% sure but it sounds like when these messages stop on the MDTs, the filesystem comes back online. There is no log on the clients though, as far as I know...

Please note that we're also in the process of fixing the locking issue described in ~~LU-11964~~ by deploying a patched 2.12.0.
Is this a known issue in 2.12? Any patch available that we can try, or suggestions would be welcomed.
Thanks,
Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

fir-md1-s1_20190715.log
2.67 MB
15/Jul/19 10:42 PM
fir-md1-s1_dk20190225.log.gz
21.15 MB
25/Feb/19 5:31 PM
fir-md1-s1_dlmtrace_20190312.log.gz
704 kB
12/Mar/19 7:47 PM
fir-md1-s1-20190228-1.log.gz
6.10 MB
01/Mar/19 3:21 AM
fir-md1-s1-20190228-2.log.gz
747 kB
01/Mar/19 3:21 AM
fir-md1-s1-20190508.log
1.05 MB
08/May/19 6:48 PM
fir-md1-s1-kern-syslog-20190228.log
598 kB
01/Mar/19 3:21 AM
fir-md1-s2_dlmtrace_20190312.log.gz
11.62 MB
12/Mar/19 7:47 PM
fir-md1-s2-20190508.log
573 kB
08/May/19 6:48 PM
fir-mdt-grafana-fs-hang_mdt1+3_20190304.png
268 kB
05/Mar/19 12:07 AM

Issue Links

is related to

LU-12037 Possible DNE issue leading to hung filesystem

Resolved

LU-12064 Adaptive timeout at_min adjustment & granularity

Reopened

Activity

People

Assignee:: Peter Jones

Reporter:: Stephane Thiell

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 21/Feb/19 9:54 PM

Updated:: 20/May/22 1:46 AM

Resolved:: 20/May/22 1:46 AM