[LU-11989] Global filesystem hangs in 2.12 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.12.0
Labels:
None
Environment:
CentOS 7.6, Lustre 2.12.0 clients and servers, some clients with 2.12.0 + patch ~~LU-11964~~

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We are having more issues with a full 2.12 production setup on Sherlock and Fir, we can notice sometimes a global filesystem hang, on all nodes, for at least 30 seconds, often more. The filesystem can run fine for 2 hours and then hang during a few minutes. This is impacting production, especially interactive jobs.

These filesystem hangs could be related to compute nodes rebooting and matching messages like the following on the MDTs:

[769459.092993] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784454/real 1550784454]  req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784461 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
[769459.120452] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[769473.130314] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784468/real 1550784468]  req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784475 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
[769473.157759] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[769494.167799] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784489/real 1550784489]  req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784496 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
[769494.195248] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 2 previous similar messages

I'm not 100% sure but it sounds like when these messages stop on the MDTs, the filesystem comes back online. There is no log on the clients though, as far as I know...

Please note that we're also in the process of fixing the locking issue described in ~~LU-11964~~ by deploying a patched 2.12.0.
Is this a known issue in 2.12? Any patch available that we can try, or suggestions would be welcomed.
Thanks,
Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

fir-md1-s1_20190715.log
2.67 MB
15/Jul/19 10:42 PM
fir-md1-s1_dk20190225.log.gz
21.15 MB
25/Feb/19 5:31 PM
fir-md1-s1_dlmtrace_20190312.log.gz
704 kB
12/Mar/19 7:47 PM
fir-md1-s1-20190228-1.log.gz
6.10 MB
01/Mar/19 3:21 AM
fir-md1-s1-20190228-2.log.gz
747 kB
01/Mar/19 3:21 AM
fir-md1-s1-20190508.log
1.05 MB
08/May/19 6:48 PM
fir-md1-s1-kern-syslog-20190228.log
598 kB
01/Mar/19 3:21 AM
fir-md1-s2_dlmtrace_20190312.log.gz
11.62 MB
12/Mar/19 7:47 PM
fir-md1-s2-20190508.log
573 kB
08/May/19 6:48 PM
fir-mdt-grafana-fs-hang_mdt1+3_20190304.png
268 kB
05/Mar/19 12:07 AM

Issue Links

is related to

LU-12037 Possible DNE issue leading to hung filesystem

Resolved

LU-12064 Adaptive timeout at_min adjustment & granularity

Reopened

Activity

[LU-11989] Global filesystem hangs in 2.12

Stephane Thiell added a comment - 19/Apr/19 10:57 PM

Just wanted to follow-up on this one, as today we noticed the same issue against regal, our 2.8-based old scratch filesystem, mounted under /regal, after rebooting a few compute nodes. We have never seen this behavior before upgrading our clients to 2.12, so we strongly suspect some regression there. It would be weird that we missed that kind of behavior for several years (and it's not like we're not rebooting nodes ). And there is no DNE, DoM, PFL involved there.

Server (2.8) logs:

Apr 19 17:38:56 regal-mds1 kernel: Lustre: regal-MDT0000: haven't heard from client e07871c1-593b-18ef-bb42-04c1e04c4db9 (at 10.8.8.24@o2ib6) in 219 seconds. I think it's dead, and I am evicting it. exp ffff881bdd5ef000, cur 1555713536 expire 1555713386 last 1555713317
Apr 19 17:38:56 regal-mds1 kernel: Lustre: Skipped 10 previous similar messages
Apr 19 17:42:40 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713749/real 1555713749]  req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713760 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Apr 19 17:42:51 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713760/real 1555713760]  req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713771 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 19 17:43:02 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713771/real 1555713771]  req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713782 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 19 17:43:13 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713782/real 1555713782]  req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713793 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 19 17:43:24 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713793/real 1555713793]  req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713804 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Apr 19 17:43:24 regal-mds1 kernel: LustreError: 34779:0:(ldlm_lockd.c:689:ldlm_handle_ast_error()) ### client (nid 10.8.27.23@o2ib6) returned error from blocking AST (req status -107 rc -107), evict it ns: mdt-regal-MDT0000_UUID lock: ffff880223094d00/0x6c9a0352c4202513 lrc: 4/0,0 mode: PR/PR res: [0x2000772e5:0x8a:0x0].0x0 bits 0x13 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.8.27.23@o2ib6 remote: 0xd278f5eabd47f9e6 expref: 14 pid: 35256 timeout: 7453544476 lvb_type: 0
Apr 19 17:43:24 regal-mds1 kernel: LustreError: 138-a: regal-MDT0000: A client on nid 10.8.27.23@o2ib6 was evicted due to a lock blocking callback time out: rc -107

My colleague had that following command blocked during that time:

[root@sh-hn01 regal]# mkdir /regal/.deleted/hasantos/

But no logs on the client (2.12).

Access to /regal was restored after a few minutes.

Stephane Thiell added a comment - 19/Apr/19 10:57 PM Just wanted to follow-up on this one, as today we noticed the same issue against regal, our 2.8-based old scratch filesystem, mounted under /regal, after rebooting a few compute nodes. We have never seen this behavior before upgrading our clients to 2.12, so we strongly suspect some regression there. It would be weird that we missed that kind of behavior for several years (and it's not like we're not rebooting nodes ). And there is no DNE, DoM, PFL involved there. Server (2.8) logs: Apr 19 17:38:56 regal-mds1 kernel: Lustre: regal-MDT0000: haven't heard from client e07871c1-593b-18ef-bb42-04c1e04c4db9 (at 10.8.8.24@o2ib6) in 219 seconds. I think it's dead, and I am evicting it. exp ffff881bdd5ef000, cur 1555713536 expire 1555713386 last 1555713317 Apr 19 17:38:56 regal-mds1 kernel: Lustre: Skipped 10 previous similar messages Apr 19 17:42:40 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713749/real 1555713749] req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713760 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Apr 19 17:42:51 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713760/real 1555713760] req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713771 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 19 17:43:02 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713771/real 1555713771] req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713782 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 19 17:43:13 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713782/real 1555713782] req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713793 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 19 17:43:24 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713793/real 1555713793] req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713804 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 19 17:43:24 regal-mds1 kernel: LustreError: 34779:0:(ldlm_lockd.c:689:ldlm_handle_ast_error()) ### client (nid 10.8.27.23@o2ib6) returned error from blocking AST (req status -107 rc -107), evict it ns: mdt-regal-MDT0000_UUID lock: ffff880223094d00/0x6c9a0352c4202513 lrc: 4/0,0 mode: PR/PR res: [0x2000772e5:0x8a:0x0].0x0 bits 0x13 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.8.27.23@o2ib6 remote: 0xd278f5eabd47f9e6 expref: 14 pid: 35256 timeout: 7453544476 lvb_type: 0 Apr 19 17:43:24 regal-mds1 kernel: LustreError: 138-a: regal-MDT0000: A client on nid 10.8.27.23@o2ib6 was evicted due to a lock blocking callback time out: rc -107 My colleague had that following command blocked during that time: [root@sh-hn01 regal] # mkdir /regal/.deleted/hasantos/ But no logs on the client (2.12). Access to /regal was restored after a few minutes.

Patrick Farrell (Inactive) added a comment - 12/Mar/19 9:24 PM

LU-12064 has been opened to look at possible adjustments to the adaptive timeout code. Note this is not a fix for this issue, just something to hopefully reduce delays.

Patrick Farrell (Inactive) added a comment - 12/Mar/19 9:24 PM LU-12064 has been opened to look at possible adjustments to the adaptive timeout code. Note this is not a fix for this issue, just something to hopefully reduce delays.

Stephane Thiell added a comment - 12/Mar/19 9:15 PM

Thanks!

Added that to our /etc/modprobe.d/lustre.conf:

# LU-11989
options ptlrpc ldlm_enqueue_min=30
options ptlrpc at_max=20

And restarted MGS +MDTs.

$ clush -w@mds -b cat /sys/module/ptlrpc/parameters/ldlm_enqueue_min 
---------------
fir-md1-s[1-2] (2)
---------------
30
$ clush -w@mds -b cat /sys/module/ptlrpc/parameters/at_max
---------------
fir-md1-s[1-2] (2)
---------------
20

Stephane Thiell added a comment - 12/Mar/19 9:15 PM Thanks! Added that to our /etc/modprobe.d/lustre.conf: # LU-11989 options ptlrpc ldlm_enqueue_min=30 options ptlrpc at_max=20 And restarted MGS +MDTs. $ clush -w@mds -b cat /sys/module/ptlrpc/parameters/ldlm_enqueue_min --------------- fir-md1-s[1-2] (2) --------------- 30 $ clush -w@mds -b cat /sys/module/ptlrpc/parameters/at_max --------------- fir-md1-s[1-2] (2) --------------- 20

Patrick Farrell (Inactive) added a comment - 12/Mar/19 8:43 PM - edited

Stephane,

Looking in to the timeouts, you can tweak them, but it's a pretty big hammer. We don't want to set these too low or it may cause trouble.

Let's aim for 30 seconds.

On only the MDSes.
You need to set ldlm_enqueue_min to 30, then at_max to 20. (1.5*at_max is the MDS bl_ast timeout. Both are ptlrpc module parameters).

This should reduce your hang time from 150 seconds to 30, and 30 should still be plenty of time.

Patrick Farrell (Inactive) added a comment - 12/Mar/19 8:43 PM - edited Stephane, Looking in to the timeouts, you can tweak them, but it's a pretty big hammer. We don't want to set these too low or it may cause trouble. Let's aim for 30 seconds. On only the MDSes. You need to set ldlm_enqueue_min to 30, then at_max to 20. (1.5*at_max is the MDS bl_ast timeout. Both are ptlrpc module parameters). This should reduce your hang time from 150 seconds to 30, and 30 should still be plenty of time.

Patrick Farrell (Inactive) added a comment - 12/Mar/19 7:55 PM

Would you be able to tar up and attach dmesg from all your MDSes?

Patrick Farrell (Inactive) added a comment - 12/Mar/19 7:55 PM Would you be able to tar up and attach dmesg from all your MDSes?

Patrick Farrell (Inactive) added a comment - 12/Mar/19 7:52 PM

And thank you for the logs, will investigate.

Patrick Farrell (Inactive) added a comment - 12/Mar/19 7:52 PM And thank you for the logs, will investigate.

Patrick Farrell (Inactive) added a comment - 12/Mar/19 7:52 PM

Stephane,

Have you changed any settings related to adaptive timeouts (AT_MIN, AT_MAX) or obd_timeout? (Not suggesting you change these yet - Just asking.)

Patrick Farrell (Inactive) added a comment - 12/Mar/19 7:52 PM Stephane, Have you changed any settings related to adaptive timeouts (AT_MIN, AT_MAX) or obd_timeout? (Not suggesting you change these yet - Just asking.)

Stephane Thiell added a comment - 12/Mar/19 7:46 PM

Wasn't very long... the client was 10.8.20.15@o2ib6 in that case:

fir-md1-s1 (MDT0000, MDT0002):

Mar 12 12:38:48 fir-md1-s1 kernel: Lustre: 31416:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552419520/real 1552419520]  req@ffff8b8c3445ef00 x1627815889426544/t0(0) o106->fir-MDT0000@10.8.20.15@o2ib6:15/16 lens 296/280 e 0 to 1 dl 1552419527 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
Mar 12 12:38:48 fir-md1-s1 kernel: Lustre: 31416:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 346 previous similar messages
Mar 12 12:39:30 fir-md1-s1 kernel: LustreError: 34992:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.8.20.15@o2ib6) failed to reply to blocking AST (req@ffff8b798eedd100 x1627815889391920 status 0 rc -110), evict it ns: mdt-fir-MDT0000_UUID lock: ffff8b677f2218c0/0x4cd4038d909bca08 lrc: 4/0,0 mode: PR/PR res: [0x20000560a:0x5ec5:0x0].0x0 bits 0x13/0x0 rrc: 56 type: IBT flags: 0x60200400000020 nid: 10.8.20.15@o2ib6 remote: 0x282f29ed39ed2a33 expref: 78 pid: 31590 timeout: 264509 lvb_type: 0
Mar 12 12:39:30 fir-md1-s1 kernel: LustreError: 138-a: fir-MDT0000: A client on nid 10.8.20.15@o2ib6 was evicted due to a lock blocking callback time out: rc -110
Mar 12 12:39:30 fir-md1-s1 kernel: LustreError: 30507:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 154s: evicting client at 10.8.20.15@o2ib6  ns: mdt-fir-MDT0000_UUID lock: ffff8b677f2218c0/0x4cd4038d909bca08 lrc: 3/0,0 mode: PR/PR res: [0x20000560a:0x5ec5:0x0].0x0 bits 0x13/0x0 rrc: 56 type: IBT flags: 0x60200400000020 nid: 10.8.20.15@o2ib6 remote: 0x282f29ed39ed2a33 expref: 79 pid: 31590 timeout: 0 lvb_type: 0
Mar 12 12:40:23 fir-md1-s1 kernel: Lustre: fir-MDT0002: haven't heard from client 9e9c0882-3f15-dd96-7d4f-45564e48c021 (at 10.8.20.15@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8b6b55711000, cur 1552419623 expire 1552419473 last 1552419396
Mar 12 12:40:23 fir-md1-s1 kernel: Lustre: Skipped 3 previous similar messages

fir-md1-s2 (MDT0001, MDT0003):

Mar 12 12:42:45 fir-md1-s2 kernel: Lustre: 31774:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552419753/real 1552419753]  req@ffff8eb94fb2f500 x1627816490690992/t0(0) o104->fir-MDT0003@10.8.20.15@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1552419764 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Mar 12 12:42:45 fir-md1-s2 kernel: Lustre: 31774:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Mar 12 12:43:29 fir-md1-s2 kernel: LustreError: 31774:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.8.20.15@o2ib6) returned error from blocking AST (req@ffff8eb94fb2f500 x1627816490690992 status -107 rc -107), evict it ns: mdt-fir-MDT0003_UUID lock: ffff8edbad0469c0/0x60006bf677e1ceee lrc: 4/0,0 mode: PR/PR res: [0x280000401:0x5:0x0].0x0 bits 0x13/0x0 rrc: 821 type: IBT flags: 0x60200400000020 nid: 10.8.20.15@o2ib6 remote: 0x282f29ed39ed259b expref: 7 pid: 31773 timeout: 264501 lvb_type: 0
Mar 12 12:43:29 fir-md1-s2 kernel: LustreError: 138-a: fir-MDT0003: A client on nid 10.8.20.15@o2ib6 was evicted due to a lock blocking callback time out: rc -107
Mar 12 12:43:29 fir-md1-s2 kernel: LustreError: 30984:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 56s: evicting client at 10.8.20.15@o2ib6  ns: mdt-fir-MDT0003_UUID lock: ffff8edbad0469c0/0x60006bf677e1ceee lrc: 3/0,0 mode: PR/PR res: [0x280000401:0x5:0x0].0x0 bits 0x13/0x0 rrc: 821 type: IBT flags: 0x60200400000020 nid: 10.8.20.15@o2ib6 remote: 0x282f29ed39ed259b expref: 8 pid: 31773 timeout: 0 lvb_type: 0
Mar 12 12:43:39 fir-md1-s2 kernel: Lustre: fir-MDT0001: haven't heard from client 9e9c0882-3f15-dd96-7d4f-45564e48c021 (at 10.8.20.15@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8ed84f2d8c00, cur 1552419819 expire 1552419669 last 1552419592

I think I got the MDT lock table dumps during that time! Filesystem just unblocked while I was finializing dumping it on fir-md1-s2. Attaching fir-md1-s1_dlmtrace_20190312.log.gz and fir-md1-s2_dlmtrace_20190312.log.gz. Thx.

Stephane Thiell added a comment - 12/Mar/19 7:46 PM Wasn't very long... the client was 10.8.20.15@o2ib6 in that case: fir-md1-s1 (MDT0000, MDT0002): Mar 12 12:38:48 fir-md1-s1 kernel: Lustre: 31416:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552419520/real 1552419520] req@ffff8b8c3445ef00 x1627815889426544/t0(0) o106->fir-MDT0000@10.8.20.15@o2ib6:15/16 lens 296/280 e 0 to 1 dl 1552419527 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Mar 12 12:38:48 fir-md1-s1 kernel: Lustre: 31416:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 346 previous similar messages Mar 12 12:39:30 fir-md1-s1 kernel: LustreError: 34992:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.8.20.15@o2ib6) failed to reply to blocking AST (req@ffff8b798eedd100 x1627815889391920 status 0 rc -110), evict it ns: mdt-fir-MDT0000_UUID lock: ffff8b677f2218c0/0x4cd4038d909bca08 lrc: 4/0,0 mode: PR/PR res: [0x20000560a:0x5ec5:0x0].0x0 bits 0x13/0x0 rrc: 56 type: IBT flags: 0x60200400000020 nid: 10.8.20.15@o2ib6 remote: 0x282f29ed39ed2a33 expref: 78 pid: 31590 timeout: 264509 lvb_type: 0 Mar 12 12:39:30 fir-md1-s1 kernel: LustreError: 138-a: fir-MDT0000: A client on nid 10.8.20.15@o2ib6 was evicted due to a lock blocking callback time out: rc -110 Mar 12 12:39:30 fir-md1-s1 kernel: LustreError: 30507:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 154s: evicting client at 10.8.20.15@o2ib6 ns: mdt-fir-MDT0000_UUID lock: ffff8b677f2218c0/0x4cd4038d909bca08 lrc: 3/0,0 mode: PR/PR res: [0x20000560a:0x5ec5:0x0].0x0 bits 0x13/0x0 rrc: 56 type: IBT flags: 0x60200400000020 nid: 10.8.20.15@o2ib6 remote: 0x282f29ed39ed2a33 expref: 79 pid: 31590 timeout: 0 lvb_type: 0 Mar 12 12:40:23 fir-md1-s1 kernel: Lustre: fir-MDT0002: haven't heard from client 9e9c0882-3f15-dd96-7d4f-45564e48c021 (at 10.8.20.15@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8b6b55711000, cur 1552419623 expire 1552419473 last 1552419396 Mar 12 12:40:23 fir-md1-s1 kernel: Lustre: Skipped 3 previous similar messages fir-md1-s2 (MDT0001, MDT0003): Mar 12 12:42:45 fir-md1-s2 kernel: Lustre: 31774:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552419753/real 1552419753] req@ffff8eb94fb2f500 x1627816490690992/t0(0) o104->fir-MDT0003@10.8.20.15@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1552419764 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Mar 12 12:42:45 fir-md1-s2 kernel: Lustre: 31774:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Mar 12 12:43:29 fir-md1-s2 kernel: LustreError: 31774:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.8.20.15@o2ib6) returned error from blocking AST (req@ffff8eb94fb2f500 x1627816490690992 status -107 rc -107), evict it ns: mdt-fir-MDT0003_UUID lock: ffff8edbad0469c0/0x60006bf677e1ceee lrc: 4/0,0 mode: PR/PR res: [0x280000401:0x5:0x0].0x0 bits 0x13/0x0 rrc: 821 type: IBT flags: 0x60200400000020 nid: 10.8.20.15@o2ib6 remote: 0x282f29ed39ed259b expref: 7 pid: 31773 timeout: 264501 lvb_type: 0 Mar 12 12:43:29 fir-md1-s2 kernel: LustreError: 138-a: fir-MDT0003: A client on nid 10.8.20.15@o2ib6 was evicted due to a lock blocking callback time out: rc -107 Mar 12 12:43:29 fir-md1-s2 kernel: LustreError: 30984:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 56s: evicting client at 10.8.20.15@o2ib6 ns: mdt-fir-MDT0003_UUID lock: ffff8edbad0469c0/0x60006bf677e1ceee lrc: 3/0,0 mode: PR/PR res: [0x280000401:0x5:0x0].0x0 bits 0x13/0x0 rrc: 821 type: IBT flags: 0x60200400000020 nid: 10.8.20.15@o2ib6 remote: 0x282f29ed39ed259b expref: 8 pid: 31773 timeout: 0 lvb_type: 0 Mar 12 12:43:39 fir-md1-s2 kernel: Lustre: fir-MDT0001: haven't heard from client 9e9c0882-3f15-dd96-7d4f-45564e48c021 (at 10.8.20.15@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8ed84f2d8c00, cur 1552419819 expire 1552419669 last 1552419592 I think I got the MDT lock table dumps during that time! Filesystem just unblocked while I was finializing dumping it on fir-md1-s2. Attaching fir-md1-s1_dlmtrace_20190312.log.gz and fir-md1-s2_dlmtrace_20190312.log.gz . Thx.

Stephane Thiell added a comment - 12/Mar/19 7:30 PM

Hi Patrick!

We have Lai's patch for ~~LU-12037~~ applied on both MDS now, so...

I'll try to dump the lock tables next time this happens. Thanks!

Stephane Thiell added a comment - 12/Mar/19 7:30 PM Hi Patrick! We have Lai's patch for LU-12037 applied on both MDS now, so... I'll try to dump the lock tables next time this happens. Thanks!

Patrick Farrell (Inactive) added a comment - 12/Mar/19 7:26 PM

Stephane,

Are you able to dump the lock tables? Remember that when you did that before, it was actually that DOM/rename related hang (~~LU-12037~~), not this issue. Having the full table would be very helpful.

I will take a look at timeouts, this should be adjustable.

Patrick Farrell (Inactive) added a comment - 12/Mar/19 7:26 PM Stephane, Are you able to dump the lock tables? Remember that when you did that before, it was actually that DOM/rename related hang ( LU-12037 ), not this issue. Having the full table would be very helpful. I will take a look at timeouts, this should be adjustable.

Stephane Thiell added a comment - 12/Mar/19 7:20 PM

Again, filesystem blocked while a node crashed

Mar 12 12:06:02 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552417555/real 1552417555]  req@ffff8b6c796a4b00 x162781588295
Mar 12 12:06:16 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552417569/real 1552417569]  req@ffff8b6c796a4b00 x162781588295
Mar 12 12:06:16 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Mar 12 12:06:37 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552417590/real 1552417590]  req@ffff8b6c796a4b00 x162781588295
Mar 12 12:06:37 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Mar 12 12:07:12 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552417625/real 1552417625]  req@ffff8b6c796a4b00 x162781588295
Mar 12 12:07:12 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Mar 12 12:08:01 fir-md1-s1 kernel: LustreError: 31329:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.8.10.29@o2ib6) failed to reply to blocking AST (req@ffff8b6c796a4b00 x1627815882950880 status
Mar 12 12:08:01 fir-md1-s1 kernel: LustreError: 138-a: fir-MDT0002: A client on nid 10.8.10.29@o2ib6 was evicted due to a lock blocking callback time out: rc -110
Mar 12 12:08:01 fir-md1-s1 kernel: LustreError: 30507:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 154s: evicting client at 10.8.10.29@o2ib6  ns: mdt-fir-MDT0002_UUID lock: ffff

During that time, we couldn't ls anywhere on /fir (for about 2 minutes, tested from several nodes and subdirectories). Is there a timeout we could lower to mitigate this?

Stephane Thiell added a comment - 12/Mar/19 7:20 PM Again, filesystem blocked while a node crashed Mar 12 12:06:02 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552417555/real 1552417555] req@ffff8b6c796a4b00 x162781588295 Mar 12 12:06:16 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552417569/real 1552417569] req@ffff8b6c796a4b00 x162781588295 Mar 12 12:06:16 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message Mar 12 12:06:37 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552417590/real 1552417590] req@ffff8b6c796a4b00 x162781588295 Mar 12 12:06:37 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Mar 12 12:07:12 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552417625/real 1552417625] req@ffff8b6c796a4b00 x162781588295 Mar 12 12:07:12 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Mar 12 12:08:01 fir-md1-s1 kernel: LustreError: 31329:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.8.10.29@o2ib6) failed to reply to blocking AST (req@ffff8b6c796a4b00 x1627815882950880 status Mar 12 12:08:01 fir-md1-s1 kernel: LustreError: 138-a: fir-MDT0002: A client on nid 10.8.10.29@o2ib6 was evicted due to a lock blocking callback time out: rc -110 Mar 12 12:08:01 fir-md1-s1 kernel: LustreError: 30507:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 154s: evicting client at 10.8.10.29@o2ib6 ns: mdt-fir-MDT0002_UUID lock: ffff During that time, we couldn't ls anywhere on /fir (for about 2 minutes, tested from several nodes and subdirectories). Is there a timeout we could lower to mitigate this?

People

Assignee:: Peter Jones

Reporter:: Stephane Thiell

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 21/Feb/19 9:54 PM

Updated:: 20/May/22 1:46 AM

Resolved:: 20/May/22 1:46 AM