[LU-11989] Global filesystem hangs in 2.12 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.12.0
Labels:
None
Environment:
CentOS 7.6, Lustre 2.12.0 clients and servers, some clients with 2.12.0 + patch ~~LU-11964~~

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We are having more issues with a full 2.12 production setup on Sherlock and Fir, we can notice sometimes a global filesystem hang, on all nodes, for at least 30 seconds, often more. The filesystem can run fine for 2 hours and then hang during a few minutes. This is impacting production, especially interactive jobs.

These filesystem hangs could be related to compute nodes rebooting and matching messages like the following on the MDTs:

[769459.092993] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784454/real 1550784454]  req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784461 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
[769459.120452] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[769473.130314] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784468/real 1550784468]  req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784475 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
[769473.157759] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[769494.167799] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784489/real 1550784489]  req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784496 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
[769494.195248] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 2 previous similar messages

I'm not 100% sure but it sounds like when these messages stop on the MDTs, the filesystem comes back online. There is no log on the clients though, as far as I know...

Please note that we're also in the process of fixing the locking issue described in ~~LU-11964~~ by deploying a patched 2.12.0.
Is this a known issue in 2.12? Any patch available that we can try, or suggestions would be welcomed.
Thanks,
Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

fir-md1-s1_20190715.log
2.67 MB
15/Jul/19 10:42 PM
fir-md1-s1_dk20190225.log.gz
21.15 MB
25/Feb/19 5:31 PM
fir-md1-s1_dlmtrace_20190312.log.gz
704 kB
12/Mar/19 7:47 PM
fir-md1-s1-20190228-1.log.gz
6.10 MB
01/Mar/19 3:21 AM
fir-md1-s1-20190228-2.log.gz
747 kB
01/Mar/19 3:21 AM
fir-md1-s1-20190508.log
1.05 MB
08/May/19 6:48 PM
fir-md1-s1-kern-syslog-20190228.log
598 kB
01/Mar/19 3:21 AM
fir-md1-s2_dlmtrace_20190312.log.gz
11.62 MB
12/Mar/19 7:47 PM
fir-md1-s2-20190508.log
573 kB
08/May/19 6:48 PM
fir-mdt-grafana-fs-hang_mdt1+3_20190304.png
268 kB
05/Mar/19 12:07 AM

Issue Links

is related to

LU-12037 Possible DNE issue leading to hung filesystem

Resolved

LU-12064 Adaptive timeout at_min adjustment & granularity

Reopened

Activity

[LU-11989] Global filesystem hangs in 2.12

Stephane Thiell added a comment - 25/Feb/20 7:24 PM

Hi Olaf – No, we haven't seen such long hangs for quite a while now, which is a good thing. They were very frequent when using mkdir with DNEv2, but the issue seems to have been resolved. We're still using 2.12.3 + a few patches (server-side) but they that have all landed in 2.12.4.

Stephane Thiell added a comment - 25/Feb/20 7:24 PM Hi Olaf – No, we haven't seen such long hangs for quite a while now, which is a good thing. They were very frequent when using mkdir with DNEv2, but the issue seems to have been resolved. We're still using 2.12.3 + a few patches (server-side) but they that have all landed in 2.12.4.

Olaf Faaland added a comment - 25/Feb/20 7:17 PM

Stephane, are you still seeing these long hangs? Asking because we aren't yet using DNE2 but would like to (originally intend to with Lustre 2.8 but decided against it, and now revisiting the idea).

Olaf Faaland added a comment - 25/Feb/20 7:17 PM Stephane, are you still seeing these long hangs? Asking because we aren't yet using DNE2 but would like to (originally intend to with Lustre 2.8 but decided against it, and now revisiting the idea).

Jesse Stroik added a comment - 27/Nov/19 3:58 PM - edited

I've been observing something similar when we have clients that mistakenly associated with the MDS using an NID behind a NAT. In this example, nodes came up with lnet binding to a private cluster interface instead of an infiniband interface.

With a lustre 2.10 client and 2.12 server, we can reproduce this behavior easily. The file system hangs for all clients. A relevant snippet from our logs is included - let me know if I can provide any additional information.

Nov 26 22:04:36 apollo-mds kernel: [4167831.563633] Lustre: 62902:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1574805865/real 1574805865] req@ffff8bc8d0153a80 x1646935336777456/t0(0) o104->apollo-MDT0000@10.23.255.254@tcp:15/16 lens 296/224 e 0 to 1 dl 1574805876 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 
Nov 26 22:04:36 apollo-mds kernel: [4167831.563643] Lustre: 62902:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 94 previous similar messages 
Nov 26 22:07:12 apollo-mds kernel: [4167986.760003] Lustre: apollo-MDT0000: Connection restored to 2e47e1dd-58f1-ea9e-4af1-7b2ae2de0c79 (at 10.23.255.254@tcp) 
Nov 26 22:07:12 apollo-mds kernel: [4167986.760011] Lustre: Skipped 22 previous similar messages 
Nov 26 22:09:33 apollo-mds kernel: [4168128.561768] LustreError: 77047:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8bc6fcff4800 ns: mdt-apollo-MDT0000_UUID lock: ffff8bce53dcc800/0xe6373f3e00450a97 lrc: 3/0,0 mode: PR/PR res: [0x2000013a1:0x1:0x0].0x0 bits 0x13/0x0 rrc: 19 type: IBT flags: 0x50200400000020 nid: 10.23.255.218@tcp remote: 0x195afc50b177c9d2 expref: 6 pid: 77047 timeout: 0 lvb_type: 0 
Nov 26 22:14:24 apollo-mds kernel: [4168419.322349] LustreError: 59085:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.23.255.218@tcp) failed to reply to blocking AST (req@ffff8bb79b703a80 x1646935336873728 status 0 rc -110), evict it ns: mdt-apollo-MDT0000_UUID lock: ffff8bc53a36e9c0/0xe6373f3e00514e3a lrc: 4/0,0 mode: PR/PR res: [0x20000c990:0x2302:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 10.23.255.218@tcp remote: 0x195afc50b177caff expref: 531 pid: 67366 timeout: 4168744 lvb_type: 0

Jesse Stroik added a comment - 27/Nov/19 3:58 PM - edited I've been observing something similar when we have clients that mistakenly associated with the MDS using an NID behind a NAT. In this example, nodes came up with lnet binding to a private cluster interface instead of an infiniband interface. With a lustre 2.10 client and 2.12 server, we can reproduce this behavior easily. The file system hangs for all clients. A relevant snippet from our logs is included - let me know if I can provide any additional information. Nov 26 22:04:36 apollo-mds kernel: [4167831.563633] Lustre: 62902:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1574805865/real 1574805865] req@ffff8bc8d0153a80 x1646935336777456/t0(0) o104->apollo-MDT0000@10.23.255.254@tcp:15/16 lens 296/224 e 0 to 1 dl 1574805876 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Nov 26 22:04:36 apollo-mds kernel: [4167831.563643] Lustre: 62902:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 94 previous similar messages Nov 26 22:07:12 apollo-mds kernel: [4167986.760003] Lustre: apollo-MDT0000: Connection restored to 2e47e1dd-58f1-ea9e-4af1-7b2ae2de0c79 (at 10.23.255.254@tcp) Nov 26 22:07:12 apollo-mds kernel: [4167986.760011] Lustre: Skipped 22 previous similar messages Nov 26 22:09:33 apollo-mds kernel: [4168128.561768] LustreError: 77047:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8bc6fcff4800 ns: mdt-apollo-MDT0000_UUID lock: ffff8bce53dcc800/0xe6373f3e00450a97 lrc: 3/0,0 mode: PR/PR res: [0x2000013a1:0x1:0x0].0x0 bits 0x13/0x0 rrc: 19 type: IBT flags: 0x50200400000020 nid: 10.23.255.218@tcp remote: 0x195afc50b177c9d2 expref: 6 pid: 77047 timeout: 0 lvb_type: 0 Nov 26 22:14:24 apollo-mds kernel: [4168419.322349] LustreError: 59085:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.23.255.218@tcp) failed to reply to blocking AST (req@ffff8bb79b703a80 x1646935336873728 status 0 rc -110), evict it ns: mdt-apollo-MDT0000_UUID lock: ffff8bc53a36e9c0/0xe6373f3e00514e3a lrc: 4/0,0 mode: PR/PR res: [0x20000c990:0x2302:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 10.23.255.218@tcp remote: 0x195afc50b177caff expref: 531 pid: 67366 timeout: 4168744 lvb_type: 0

Stephane Thiell added a comment - 15/Jul/19 10:49 PM

This issue is still ongoing, and still seems to happen when we use mkdir with DNEv2 (each time we hit this issue there is a mkdir using DNEv2 running on sh-hn01 (10.9.0.1@o2ib4)...

The full filesystem was blocked again today for more than 10 minutes while a mkdir was running there. There are some backtraces like that when things started to recover:

Jul 15 15:34:26 fir-md1-s1 kernel: LNet: Service thread pid 23077 was inactive for 200.25s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jul 15 15:34:26 fir-md1-s1 kernel: LNet: Skipped 2 previous similar messages
Jul 15 15:34:26 fir-md1-s1 kernel: Pid: 23077, comm: mdt02_042 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
Jul 15 15:34:26 fir-md1-s1 kernel: Call Trace:
Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc0d92e55>] ldlm_completion_ast+0x4e5/0x890 [ptlrpc]
Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc0d93cbc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc14744bb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc1474b40>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc1474f0c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc148c44c>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc148fd88>] mdt_reint_setattr+0x6c8/0x1340 [mdt]
Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc1491ee3>] mdt_reint_rec+0x83/0x210 [mdt]
Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc1470143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc147b4a7>] mdt_reint+0x67/0x140 [mdt]
Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc0e3073a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc0dd4d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc0dd863c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffff84ec1c31>] kthread+0xd1/0xe0
Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffff85574c24>] ret_from_fork_nospec_begin+0xe/0x21
Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1563230066.23077
Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: 20378:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 29s: evicting client at 10.8.9.9@o2ib6  ns: mdt-fir-MDT0000_UUID lock: ffff8f31c0f9e540/0x5d9ee640b61b6f03 lrc: 3/0,0 mode: PR/PR res: [0x2000297f6:0x882:0x0].0x0 bits 0x5b/0x0 rrc: 8 type: IBT flags: 0x60200400000020 nid: 10.8.9.9@o2ib6 remote: 0x243de7f26418e7fc expref: 16849 pid: 25680 timeout: 2345126 lvb_type: 0
Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: 20378:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 10 previous similar messages
Jul 15 15:34:36 fir-md1-s1 kernel: LNet: Service thread pid 23077 completed after 209.64s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

I'm attaching MDS logs `fir-md1-s1` as fir-md1-s1_20190715.log (server where is hosted MDT0000, MDT0002 and the MGT).

WARNING: this is the full kernel logs since last MDS restart, June 18, 2019!!! Not too bad for 2.12!! Our event started on July 15, 2019 at about 15:20 in the log. First user ticket arrived at 15:24.

The mkdir that was blocked during this event was on the following directory, automatically stripped on MDT0, which eventually succeeded:

[root@sh-hn01 sthiell.root]# lfs getdirstripe /fir/users/kmoulin
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none

As a reminder:

[root@sh-hn01 sthiell.root]# lfs getdirstripe /fir/users
lmv_stripe_count: 4 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
mdtidx		 FID[seq:oid:ver]
     0		 [0x200000400:0x5:0x0]		
     1		 [0x240000402:0x5:0x0]		
     2		 [0x2c0000400:0x5:0x0]		
     3		 [0x280000401:0x5:0x0]

The good thing is that the MDS recovered by itself but it took some time.... at least 10 minutes.

Thanks,
Stephane

Stephane Thiell added a comment - 15/Jul/19 10:49 PM This issue is still ongoing, and still seems to happen when we use mkdir with DNEv2 (each time we hit this issue there is a mkdir using DNEv2 running on sh-hn01 (10.9.0.1@o2ib4)... The full filesystem was blocked again today for more than 10 minutes while a mkdir was running there. There are some backtraces like that when things started to recover: Jul 15 15:34:26 fir-md1-s1 kernel: LNet: Service thread pid 23077 was inactive for 200.25s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jul 15 15:34:26 fir-md1-s1 kernel: LNet: Skipped 2 previous similar messages Jul 15 15:34:26 fir-md1-s1 kernel: Pid: 23077, comm: mdt02_042 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Jul 15 15:34:26 fir-md1-s1 kernel: Call Trace: Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc0d92e55>] ldlm_completion_ast+0x4e5/0x890 [ptlrpc] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc0d93cbc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc14744bb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc1474b40>] mdt_object_lock_internal+0x70/0x3e0 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc1474f0c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc148c44c>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc148fd88>] mdt_reint_setattr+0x6c8/0x1340 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc1491ee3>] mdt_reint_rec+0x83/0x210 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc1470143>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc147b4a7>] mdt_reint+0x67/0x140 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc0e3073a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc0dd4d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc0dd863c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffff84ec1c31>] kthread+0xd1/0xe0 Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffff85574c24>] ret_from_fork_nospec_begin+0xe/0x21 Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1563230066.23077 Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: 20378:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 29s: evicting client at 10.8.9.9@o2ib6 ns: mdt-fir-MDT0000_UUID lock: ffff8f31c0f9e540/0x5d9ee640b61b6f03 lrc: 3/0,0 mode: PR/PR res: [0x2000297f6:0x882:0x0].0x0 bits 0x5b/0x0 rrc: 8 type: IBT flags: 0x60200400000020 nid: 10.8.9.9@o2ib6 remote: 0x243de7f26418e7fc expref: 16849 pid: 25680 timeout: 2345126 lvb_type: 0 Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: 20378:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 10 previous similar messages Jul 15 15:34:36 fir-md1-s1 kernel: LNet: Service thread pid 23077 completed after 209.64s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). I'm attaching MDS logs `fir-md1-s1` as fir-md1-s1_20190715.log (server where is hosted MDT0000, MDT0002 and the MGT). WARNING: this is the full kernel logs since last MDS restart, June 18, 2019!!! Not too bad for 2.12!! Our event started on July 15, 2019 at about 15:20 in the log. First user ticket arrived at 15:24. The mkdir that was blocked during this event was on the following directory, automatically stripped on MDT0, which eventually succeeded: [root@sh-hn01 sthiell.root]# lfs getdirstripe /fir/users/kmoulin lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none As a reminder: [root@sh-hn01 sthiell.root]# lfs getdirstripe /fir/users lmv_stripe_count: 4 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64 mdtidx FID[seq:oid:ver] 0 [0x200000400:0x5:0x0] 1 [0x240000402:0x5:0x0] 2 [0x2c0000400:0x5:0x0] 3 [0x280000401:0x5:0x0] The good thing is that the MDS recovered by itself but it took some time.... at least 10 minutes. Thanks, Stephane

Stephane Thiell added a comment - 08/May/19 6:49 PM

We still had the issue this morning, this was when Kilian was creating a bunch of new accounts (mkdir) along with rebooting nodes. mkdir was done in /fir/users and /fir/groups which are DNE2 enabled.

Not everything was blocked, but find in /fir/users and /fir/groups was hanging on all compute nodes. However, I was able to browse /fir/users/sthiell/ without issue (DNE1 only).

Attaching full kernel logs from both MDS as fir-md1-s1-20190508.log and fir-md1-s2-20190508.log. The issue happened around May 08 11:23 in the logs which can particularly be seen in fir-md1-s2-20190508.log

After a few minutes, things went back to normal.

Stephane Thiell added a comment - 08/May/19 6:49 PM We still had the issue this morning, this was when Kilian was creating a bunch of new accounts (mkdir) along with rebooting nodes. mkdir was done in /fir/users and /fir/groups which are DNE2 enabled. Not everything was blocked, but find in /fir/users and /fir/groups was hanging on all compute nodes. However, I was able to browse /fir/users/sthiell/ without issue (DNE1 only). Attaching full kernel logs from both MDS as fir-md1-s1-20190508.log and fir-md1-s2-20190508.log . The issue happened around May 08 11:23 in the logs which can particularly be seen in fir-md1-s2-20190508.log After a few minutes, things went back to normal.

Stephane Thiell added a comment - 19/Apr/19 10:57 PM

Just wanted to follow-up on this one, as today we noticed the same issue against regal, our 2.8-based old scratch filesystem, mounted under /regal, after rebooting a few compute nodes. We have never seen this behavior before upgrading our clients to 2.12, so we strongly suspect some regression there. It would be weird that we missed that kind of behavior for several years (and it's not like we're not rebooting nodes ). And there is no DNE, DoM, PFL involved there.

Server (2.8) logs:

Apr 19 17:38:56 regal-mds1 kernel: Lustre: regal-MDT0000: haven't heard from client e07871c1-593b-18ef-bb42-04c1e04c4db9 (at 10.8.8.24@o2ib6) in 219 seconds. I think it's dead, and I am evicting it. exp ffff881bdd5ef000, cur 1555713536 expire 1555713386 last 1555713317
Apr 19 17:38:56 regal-mds1 kernel: Lustre: Skipped 10 previous similar messages
Apr 19 17:42:40 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713749/real 1555713749]  req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713760 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Apr 19 17:42:51 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713760/real 1555713760]  req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713771 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 19 17:43:02 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713771/real 1555713771]  req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713782 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 19 17:43:13 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713782/real 1555713782]  req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713793 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 19 17:43:24 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713793/real 1555713793]  req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713804 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 19 17:43:24 regal-mds1 kernel: LustreError: 34779:0:(ldlm_lockd.c:689:ldlm_handle_ast_error()) ### client (nid 10.8.27.23@o2ib6) returned error from blocking AST (req status -107 rc -107), evict it ns: mdt-regal-MDT0000_UUID lock: ffff880223094d00/0x6c9a0352c4202513 lrc: 4/0,0 mode: PR/PR res: [0x2000772e5:0x8a:0x0].0x0 bits 0x13 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.8.27.23@o2ib6 remote: 0xd278f5eabd47f9e6 expref: 14 pid: 35256 timeout: 7453544476 lvb_type: 0
Apr 19 17:43:24 regal-mds1 kernel: LustreError: 138-a: regal-MDT0000: A client on nid 10.8.27.23@o2ib6 was evicted due to a lock blocking callback time out: rc -107

My colleague had that following command blocked during that time:

[root@sh-hn01 regal]# mkdir /regal/.deleted/hasantos/

But no logs on the client (2.12).

Access to /regal was restored after a few minutes.

Stephane Thiell added a comment - 19/Apr/19 10:57 PM Just wanted to follow-up on this one, as today we noticed the same issue against regal, our 2.8-based old scratch filesystem, mounted under /regal, after rebooting a few compute nodes. We have never seen this behavior before upgrading our clients to 2.12, so we strongly suspect some regression there. It would be weird that we missed that kind of behavior for several years (and it's not like we're not rebooting nodes ). And there is no DNE, DoM, PFL involved there. Server (2.8) logs: Apr 19 17:38:56 regal-mds1 kernel: Lustre: regal-MDT0000: haven't heard from client e07871c1-593b-18ef-bb42-04c1e04c4db9 (at 10.8.8.24@o2ib6) in 219 seconds. I think it's dead, and I am evicting it. exp ffff881bdd5ef000, cur 1555713536 expire 1555713386 last 1555713317 Apr 19 17:38:56 regal-mds1 kernel: Lustre: Skipped 10 previous similar messages Apr 19 17:42:40 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713749/real 1555713749] req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713760 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Apr 19 17:42:51 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713760/real 1555713760] req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713771 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 19 17:43:02 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713771/real 1555713771] req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713782 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 19 17:43:13 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713782/real 1555713782] req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713793 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 19 17:43:24 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713793/real 1555713793] req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713804 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 19 17:43:24 regal-mds1 kernel: LustreError: 34779:0:(ldlm_lockd.c:689:ldlm_handle_ast_error()) ### client (nid 10.8.27.23@o2ib6) returned error from blocking AST (req status -107 rc -107), evict it ns: mdt-regal-MDT0000_UUID lock: ffff880223094d00/0x6c9a0352c4202513 lrc: 4/0,0 mode: PR/PR res: [0x2000772e5:0x8a:0x0].0x0 bits 0x13 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.8.27.23@o2ib6 remote: 0xd278f5eabd47f9e6 expref: 14 pid: 35256 timeout: 7453544476 lvb_type: 0 Apr 19 17:43:24 regal-mds1 kernel: LustreError: 138-a: regal-MDT0000: A client on nid 10.8.27.23@o2ib6 was evicted due to a lock blocking callback time out: rc -107 My colleague had that following command blocked during that time: [root@sh-hn01 regal] # mkdir /regal/.deleted/hasantos/ But no logs on the client (2.12). Access to /regal was restored after a few minutes.

Patrick Farrell (Inactive) added a comment - 12/Mar/19 9:24 PM

LU-12064 has been opened to look at possible adjustments to the adaptive timeout code. Note this is not a fix for this issue, just something to hopefully reduce delays.

Patrick Farrell (Inactive) added a comment - 12/Mar/19 9:24 PM LU-12064 has been opened to look at possible adjustments to the adaptive timeout code. Note this is not a fix for this issue, just something to hopefully reduce delays.

Stephane Thiell added a comment - 12/Mar/19 9:15 PM

Thanks!

Added that to our /etc/modprobe.d/lustre.conf:

# LU-11989
options ptlrpc ldlm_enqueue_min=30
options ptlrpc at_max=20

And restarted MGS +MDTs.

$ clush -w@mds -b cat /sys/module/ptlrpc/parameters/ldlm_enqueue_min 
---------------
fir-md1-s[1-2] (2)
---------------
30
$ clush -w@mds -b cat /sys/module/ptlrpc/parameters/at_max
---------------
fir-md1-s[1-2] (2)
---------------
20

Stephane Thiell added a comment - 12/Mar/19 9:15 PM Thanks! Added that to our /etc/modprobe.d/lustre.conf: # LU-11989 options ptlrpc ldlm_enqueue_min=30 options ptlrpc at_max=20 And restarted MGS +MDTs. $ clush -w@mds -b cat /sys/module/ptlrpc/parameters/ldlm_enqueue_min --------------- fir-md1-s[1-2] (2) --------------- 30 $ clush -w@mds -b cat /sys/module/ptlrpc/parameters/at_max --------------- fir-md1-s[1-2] (2) --------------- 20

People

Assignee:: Peter Jones

Reporter:: Stephane Thiell

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 21/Feb/19 9:54 PM

Updated:: 20/May/22 1:46 AM

Resolved:: 20/May/22 1:46 AM