Details
-
Bug
-
Resolution: Cannot Reproduce
-
Blocker
-
None
-
Lustre 2.12.0
-
None
-
CentOS 7.6, Lustre 2.12.0 clients and servers, some clients with 2.12.0 + patch
LU-11964
-
3
-
9223372036854775807
Description
We are having more issues with a full 2.12 production setup on Sherlock and Fir, we can notice sometimes a global filesystem hang, on all nodes, for at least 30 seconds, often more. The filesystem can run fine for 2 hours and then hang during a few minutes. This is impacting production, especially interactive jobs.
These filesystem hangs could be related to compute nodes rebooting and matching messages like the following on the MDTs:
[769459.092993] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784454/real 1550784454] req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784461 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 [769459.120452] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message [769473.130314] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784468/real 1550784468] req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784475 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 [769473.157759] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message [769494.167799] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784489/real 1550784489] req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784496 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 [769494.195248] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
I'm not 100% sure but it sounds like when these messages stop on the MDTs, the filesystem comes back online. There is no log on the clients though, as far as I know...
Please note that we're also in the process of fixing the locking issue described in LU-11964 by deploying a patched 2.12.0.
Is this a known issue in 2.12? Any patch available that we can try, or suggestions would be welcomed.
Thanks,
Stephane
This issue is still ongoing, and still seems to happen when we use mkdir with DNEv2 (each time we hit this issue there is a mkdir using DNEv2 running on sh-hn01 (10.9.0.1@o2ib4)...
The full filesystem was blocked again today for more than 10 minutes while a mkdir was running there. There are some backtraces like that when things started to recover:
I'm attaching MDS logs `fir-md1-s1` as fir-md1-s1_20190715.log
(server where is hosted MDT0000, MDT0002 and the MGT).
WARNING: this is the full kernel logs since last MDS restart, June 18, 2019!!! Not too bad for 2.12!! Our event started on July 15, 2019 at about 15:20 in the log. First user ticket arrived at 15:24.
The mkdir that was blocked during this event was on the following directory, automatically stripped on MDT0, which eventually succeeded:
As a reminder:
The good thing is that the MDS recovered by itself but it took some time.... at least 10 minutes.
Thanks,
Stephane