Details

    • Bug
    • Resolution: Cannot Reproduce
    • Blocker
    • None
    • Lustre 2.12.0
    • None
    • CentOS 7.6, Lustre 2.12.0 clients and servers, some clients with 2.12.0 + patch LU-11964
    • 3
    • 9223372036854775807

    Description

      We are having more issues with a full 2.12 production setup on Sherlock and Fir, we can notice sometimes a global filesystem hang, on all nodes, for at least 30 seconds, often more. The filesystem can run fine for 2 hours and then hang during a few minutes. This is impacting production, especially interactive jobs.

      These filesystem hangs could be related to compute nodes rebooting and matching messages like the following on the MDTs:

      [769459.092993] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784454/real 1550784454]  req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784461 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
      [769459.120452] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      [769473.130314] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784468/real 1550784468]  req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784475 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
      [769473.157759] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      [769494.167799] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784489/real 1550784489]  req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784496 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
      [769494.195248] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      

      I'm not 100% sure but it sounds like when these messages stop on the MDTs, the filesystem comes back online. There is no log on the clients though, as far as I know...

      Please note that we're also in the process of fixing the locking issue described in LU-11964 by deploying a patched 2.12.0.
      Is this a known issue in 2.12? Any patch available that we can try, or suggestions would be welcomed.
      Thanks,
      Stephane

      Attachments

        1. fir-md1-s1_20190715.log
          2.67 MB
        2. fir-md1-s1_dk20190225.log.gz
          21.15 MB
        3. fir-md1-s1_dlmtrace_20190312.log.gz
          704 kB
        4. fir-md1-s1-20190228-1.log.gz
          6.10 MB
        5. fir-md1-s1-20190228-2.log.gz
          747 kB
        6. fir-md1-s1-20190508.log
          1.05 MB
        7. fir-md1-s1-kern-syslog-20190228.log
          598 kB
        8. fir-md1-s2_dlmtrace_20190312.log.gz
          11.62 MB
        9. fir-md1-s2-20190508.log
          573 kB
        10. fir-mdt-grafana-fs-hang_mdt1+3_20190304.png
          fir-mdt-grafana-fs-hang_mdt1+3_20190304.png
          268 kB

        Issue Links

          Activity

            [LU-11989] Global filesystem hangs in 2.12

            Hi Olaf – No, we haven't seen such long hangs for quite a while now, which is a good thing. They were very frequent when using mkdir with DNEv2, but the issue seems to have been resolved. We're still using 2.12.3 + a few patches (server-side) but they that have all landed in 2.12.4.

            sthiell Stephane Thiell added a comment - Hi Olaf – No, we haven't seen such long hangs for quite a while now, which is a good thing. They were very frequent when using mkdir with DNEv2, but the issue seems to have been resolved. We're still using 2.12.3 + a few patches (server-side) but they that have all landed in 2.12.4.
            ofaaland Olaf Faaland added a comment -

            Stephane, are you still seeing these long hangs?  Asking because we aren't yet using DNE2 but would like to (originally intend to with Lustre 2.8 but decided against it, and now revisiting the idea).

            ofaaland Olaf Faaland added a comment - Stephane, are you still seeing these long hangs?  Asking because we aren't yet using DNE2 but would like to (originally intend to with Lustre 2.8 but decided against it, and now revisiting the idea).
            jstroik Jesse Stroik added a comment - - edited

            I've been observing something similar when we have clients that mistakenly associated with the MDS using an NID behind a NAT. In this example, nodes came up with lnet binding to a private cluster interface instead of an infiniband interface.

            With a lustre 2.10 client and 2.12 server, we can reproduce this behavior easily. The file system hangs for all clients. A relevant snippet from our logs is included - let me know if I can provide any additional information.

             

            Nov 26 22:04:36 apollo-mds kernel: [4167831.563633] Lustre: 62902:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1574805865/real 1574805865] req@ffff8bc8d0153a80 x1646935336777456/t0(0) o104->apollo-MDT0000@10.23.255.254@tcp:15/16 lens 296/224 e 0 to 1 dl 1574805876 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 
            Nov 26 22:04:36 apollo-mds kernel: [4167831.563643] Lustre: 62902:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 94 previous similar messages 
            Nov 26 22:07:12 apollo-mds kernel: [4167986.760003] Lustre: apollo-MDT0000: Connection restored to 2e47e1dd-58f1-ea9e-4af1-7b2ae2de0c79 (at 10.23.255.254@tcp) 
            Nov 26 22:07:12 apollo-mds kernel: [4167986.760011] Lustre: Skipped 22 previous similar messages 
            Nov 26 22:09:33 apollo-mds kernel: [4168128.561768] LustreError: 77047:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8bc6fcff4800 ns: mdt-apollo-MDT0000_UUID lock: ffff8bce53dcc800/0xe6373f3e00450a97 lrc: 3/0,0 mode: PR/PR res: [0x2000013a1:0x1:0x0].0x0 bits 0x13/0x0 rrc: 19 type: IBT flags: 0x50200400000020 nid: 10.23.255.218@tcp remote: 0x195afc50b177c9d2 expref: 6 pid: 77047 timeout: 0 lvb_type: 0 
            Nov 26 22:14:24 apollo-mds kernel: [4168419.322349] LustreError: 59085:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.23.255.218@tcp) failed to reply to blocking AST (req@ffff8bb79b703a80 x1646935336873728 status 0 rc -110), evict it ns: mdt-apollo-MDT0000_UUID lock: ffff8bc53a36e9c0/0xe6373f3e00514e3a lrc: 4/0,0 mode: PR/PR res: [0x20000c990:0x2302:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 10.23.255.218@tcp remote: 0x195afc50b177caff expref: 531 pid: 67366 timeout: 4168744 lvb_type: 0
            
            jstroik Jesse Stroik added a comment - - edited I've been observing something similar when we have clients that mistakenly associated with the MDS using an NID behind a NAT. In this example, nodes came up with lnet binding to a private cluster interface instead of an infiniband interface. With a lustre 2.10 client and 2.12 server, we can reproduce this behavior easily. The file system hangs for all clients. A relevant snippet from our logs is included - let me know if I can provide any additional information.   Nov 26 22:04:36 apollo-mds kernel: [4167831.563633] Lustre: 62902:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1574805865/real 1574805865] req@ffff8bc8d0153a80 x1646935336777456/t0(0) o104->apollo-MDT0000@10.23.255.254@tcp:15/16 lens 296/224 e 0 to 1 dl 1574805876 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Nov 26 22:04:36 apollo-mds kernel: [4167831.563643] Lustre: 62902:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 94 previous similar messages Nov 26 22:07:12 apollo-mds kernel: [4167986.760003] Lustre: apollo-MDT0000: Connection restored to 2e47e1dd-58f1-ea9e-4af1-7b2ae2de0c79 (at 10.23.255.254@tcp) Nov 26 22:07:12 apollo-mds kernel: [4167986.760011] Lustre: Skipped 22 previous similar messages Nov 26 22:09:33 apollo-mds kernel: [4168128.561768] LustreError: 77047:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8bc6fcff4800 ns: mdt-apollo-MDT0000_UUID lock: ffff8bce53dcc800/0xe6373f3e00450a97 lrc: 3/0,0 mode: PR/PR res: [0x2000013a1:0x1:0x0].0x0 bits 0x13/0x0 rrc: 19 type: IBT flags: 0x50200400000020 nid: 10.23.255.218@tcp remote: 0x195afc50b177c9d2 expref: 6 pid: 77047 timeout: 0 lvb_type: 0 Nov 26 22:14:24 apollo-mds kernel: [4168419.322349] LustreError: 59085:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.23.255.218@tcp) failed to reply to blocking AST (req@ffff8bb79b703a80 x1646935336873728 status 0 rc -110), evict it ns: mdt-apollo-MDT0000_UUID lock: ffff8bc53a36e9c0/0xe6373f3e00514e3a lrc: 4/0,0 mode: PR/PR res: [0x20000c990:0x2302:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 10.23.255.218@tcp remote: 0x195afc50b177caff expref: 531 pid: 67366 timeout: 4168744 lvb_type: 0

            This issue is still ongoing, and still seems to happen when we use mkdir with DNEv2 (each time we hit this issue there is a mkdir using DNEv2 running on sh-hn01 (10.9.0.1@o2ib4)...

            The full filesystem was blocked again today for more than 10 minutes while a mkdir was running there. There are some backtraces like that when things started to recover:

            Jul 15 15:34:26 fir-md1-s1 kernel: LNet: Service thread pid 23077 was inactive for 200.25s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            Jul 15 15:34:26 fir-md1-s1 kernel: LNet: Skipped 2 previous similar messages
            Jul 15 15:34:26 fir-md1-s1 kernel: Pid: 23077, comm: mdt02_042 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
            Jul 15 15:34:26 fir-md1-s1 kernel: Call Trace:
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc0d92e55>] ldlm_completion_ast+0x4e5/0x890 [ptlrpc]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc0d93cbc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc14744bb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc1474b40>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc1474f0c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc148c44c>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc148fd88>] mdt_reint_setattr+0x6c8/0x1340 [mdt]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc1491ee3>] mdt_reint_rec+0x83/0x210 [mdt]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc1470143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc147b4a7>] mdt_reint+0x67/0x140 [mdt]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc0e3073a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc0dd4d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc0dd863c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffff84ec1c31>] kthread+0xd1/0xe0
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffff85574c24>] ret_from_fork_nospec_begin+0xe/0x21
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
            Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1563230066.23077
            Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: 20378:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 29s: evicting client at 10.8.9.9@o2ib6  ns: mdt-fir-MDT0000_UUID lock: ffff8f31c0f9e540/0x5d9ee640b61b6f03 lrc: 3/0,0 mode: PR/PR res: [0x2000297f6:0x882:0x0].0x0 bits 0x5b/0x0 rrc: 8 type: IBT flags: 0x60200400000020 nid: 10.8.9.9@o2ib6 remote: 0x243de7f26418e7fc expref: 16849 pid: 25680 timeout: 2345126 lvb_type: 0
            Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: 20378:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 10 previous similar messages
            Jul 15 15:34:36 fir-md1-s1 kernel: LNet: Service thread pid 23077 completed after 209.64s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            

            I'm attaching MDS logs `fir-md1-s1` as fir-md1-s1_20190715.log (server where is hosted MDT0000, MDT0002 and the MGT).

            WARNING: this is the full kernel logs since last MDS restart, June 18, 2019!!! Not too bad for 2.12!! Our event started on July 15, 2019 at about 15:20 in the log. First user ticket arrived at 15:24.

            The mkdir that was blocked during this event was on the following directory, automatically stripped on MDT0, which eventually succeeded:

            [root@sh-hn01 sthiell.root]# lfs getdirstripe /fir/users/kmoulin
            lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
            

            As a reminder:

            [root@sh-hn01 sthiell.root]# lfs getdirstripe /fir/users
            lmv_stripe_count: 4 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
            mdtidx		 FID[seq:oid:ver]
                 0		 [0x200000400:0x5:0x0]		
                 1		 [0x240000402:0x5:0x0]		
                 2		 [0x2c0000400:0x5:0x0]		
                 3		 [0x280000401:0x5:0x0]	
            

            The good thing is that the MDS recovered by itself but it took some time.... at least 10 minutes.

            Thanks,
            Stephane

             

            sthiell Stephane Thiell added a comment - This issue is still ongoing, and still seems to happen when we use mkdir with DNEv2 (each time we hit this issue there is a mkdir using DNEv2 running on sh-hn01 (10.9.0.1@o2ib4)... The full filesystem was blocked again today for more than 10 minutes while a mkdir was running there. There are some backtraces like that when things started to recover: Jul 15 15:34:26 fir-md1-s1 kernel: LNet: Service thread pid 23077 was inactive for 200.25s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jul 15 15:34:26 fir-md1-s1 kernel: LNet: Skipped 2 previous similar messages Jul 15 15:34:26 fir-md1-s1 kernel: Pid: 23077, comm: mdt02_042 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Jul 15 15:34:26 fir-md1-s1 kernel: Call Trace: Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc0d92e55>] ldlm_completion_ast+0x4e5/0x890 [ptlrpc] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc0d93cbc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc14744bb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc1474b40>] mdt_object_lock_internal+0x70/0x3e0 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc1474f0c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc148c44c>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc148fd88>] mdt_reint_setattr+0x6c8/0x1340 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc1491ee3>] mdt_reint_rec+0x83/0x210 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc1470143>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc147b4a7>] mdt_reint+0x67/0x140 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc0e3073a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc0dd4d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc0dd863c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffff84ec1c31>] kthread+0xd1/0xe0 Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffff85574c24>] ret_from_fork_nospec_begin+0xe/0x21 Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1563230066.23077 Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: 20378:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 29s: evicting client at 10.8.9.9@o2ib6 ns: mdt-fir-MDT0000_UUID lock: ffff8f31c0f9e540/0x5d9ee640b61b6f03 lrc: 3/0,0 mode: PR/PR res: [0x2000297f6:0x882:0x0].0x0 bits 0x5b/0x0 rrc: 8 type: IBT flags: 0x60200400000020 nid: 10.8.9.9@o2ib6 remote: 0x243de7f26418e7fc expref: 16849 pid: 25680 timeout: 2345126 lvb_type: 0 Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: 20378:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 10 previous similar messages Jul 15 15:34:36 fir-md1-s1 kernel: LNet: Service thread pid 23077 completed after 209.64s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). I'm attaching MDS logs `fir-md1-s1` as fir-md1-s1_20190715.log (server where is hosted MDT0000, MDT0002 and the MGT). WARNING: this is the full kernel logs since last MDS restart, June 18, 2019!!! Not too bad for 2.12!! Our event started on July 15, 2019 at about 15:20 in the log. First user ticket arrived at 15:24. The mkdir that was blocked during this event was on the following directory, automatically stripped on MDT0, which eventually succeeded: [root@sh-hn01 sthiell.root]# lfs getdirstripe /fir/users/kmoulin lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none As a reminder: [root@sh-hn01 sthiell.root]# lfs getdirstripe /fir/users lmv_stripe_count: 4 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64 mdtidx FID[seq:oid:ver] 0 [0x200000400:0x5:0x0] 1 [0x240000402:0x5:0x0] 2 [0x2c0000400:0x5:0x0] 3 [0x280000401:0x5:0x0] The good thing is that the MDS recovered by itself but it took some time.... at least 10 minutes. Thanks, Stephane  

            We still had the issue this morning, this was when Kilian was creating a bunch of new accounts (mkdir) along with rebooting nodes. mkdir was done in /fir/users and /fir/groups which are DNE2 enabled.

            Not everything was blocked, but find in /fir/users and /fir/groups was hanging on all compute nodes. However, I was able to browse /fir/users/sthiell/ without issue (DNE1 only).

             

            Attaching full kernel logs from both MDS as fir-md1-s1-20190508.log and fir-md1-s2-20190508.log. The issue happened around May 08 11:23 in the logs which can particularly be seen in fir-md1-s2-20190508.log

             

            After a few minutes, things went back to normal.

            sthiell Stephane Thiell added a comment - We still had the issue this morning, this was when Kilian was creating a bunch of new accounts (mkdir) along with rebooting nodes. mkdir was done in /fir/users and /fir/groups which are DNE2 enabled. Not everything was blocked, but find in /fir/users and /fir/groups was hanging on all compute nodes. However, I was able to browse /fir/users/sthiell/ without issue (DNE1 only).   Attaching full kernel logs from both MDS as fir-md1-s1-20190508.log  and  fir-md1-s2-20190508.log . The issue happened around May 08 11:23 in the logs which can particularly be seen in fir-md1-s2-20190508.log   After a few minutes, things went back to normal.

            Just wanted to follow-up on this one, as today we noticed the same issue against regal, our 2.8-based old scratch filesystem, mounted under /regal, after rebooting a few compute nodes. We have never seen this behavior before upgrading our clients to 2.12, so we strongly suspect some regression there. It would be weird that we missed that kind of behavior for several years (and it's not like we're not rebooting nodes ). And there is no DNE, DoM, PFL involved there.

            Server (2.8) logs:

            Apr 19 17:38:56 regal-mds1 kernel: Lustre: regal-MDT0000: haven't heard from client e07871c1-593b-18ef-bb42-04c1e04c4db9 (at 10.8.8.24@o2ib6) in 219 seconds. I think it's dead, and I am evicting it. exp ffff881bdd5ef000, cur 1555713536 expire 1555713386 last 1555713317
            Apr 19 17:38:56 regal-mds1 kernel: Lustre: Skipped 10 previous similar messages
            Apr 19 17:42:40 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713749/real 1555713749]  req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713760 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
            Apr 19 17:42:51 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713760/real 1555713760]  req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713771 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
            Apr 19 17:43:02 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713771/real 1555713771]  req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713782 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
            Apr 19 17:43:13 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713782/real 1555713782]  req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713793 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
            Apr 19 17:43:24 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713793/real 1555713793]  req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713804 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
            Apr 19 17:43:24 regal-mds1 kernel: LustreError: 34779:0:(ldlm_lockd.c:689:ldlm_handle_ast_error()) ### client (nid 10.8.27.23@o2ib6) returned error from blocking AST (req status -107 rc -107), evict it ns: mdt-regal-MDT0000_UUID lock: ffff880223094d00/0x6c9a0352c4202513 lrc: 4/0,0 mode: PR/PR res: [0x2000772e5:0x8a:0x0].0x0 bits 0x13 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.8.27.23@o2ib6 remote: 0xd278f5eabd47f9e6 expref: 14 pid: 35256 timeout: 7453544476 lvb_type: 0
            Apr 19 17:43:24 regal-mds1 kernel: LustreError: 138-a: regal-MDT0000: A client on nid 10.8.27.23@o2ib6 was evicted due to a lock blocking callback time out: rc -107
             

            My colleague had that following command blocked during that time:

            [root@sh-hn01 regal]# mkdir /regal/.deleted/hasantos/

            But no logs on the client (2.12).

            Access to /regal was restored after a few minutes.

            sthiell Stephane Thiell added a comment - Just wanted to follow-up on this one, as today we noticed the same issue against regal, our 2.8-based old scratch filesystem, mounted under /regal, after rebooting a few compute nodes. We have never seen this behavior before upgrading our clients to 2.12, so we strongly suspect some regression there. It would be weird that we missed that kind of behavior for several years (and it's not like we're not rebooting nodes ). And there is no DNE, DoM, PFL involved there. Server (2.8) logs: Apr 19 17:38:56 regal-mds1 kernel: Lustre: regal-MDT0000: haven't heard from client e07871c1-593b-18ef-bb42-04c1e04c4db9 (at 10.8.8.24@o2ib6) in 219 seconds. I think it's dead, and I am evicting it. exp ffff881bdd5ef000, cur 1555713536 expire 1555713386 last 1555713317 Apr 19 17:38:56 regal-mds1 kernel: Lustre: Skipped 10 previous similar messages Apr 19 17:42:40 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713749/real 1555713749] req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713760 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Apr 19 17:42:51 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713760/real 1555713760] req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713771 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 19 17:43:02 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713771/real 1555713771] req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713782 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 19 17:43:13 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713782/real 1555713782] req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713793 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 19 17:43:24 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713793/real 1555713793] req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713804 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 19 17:43:24 regal-mds1 kernel: LustreError: 34779:0:(ldlm_lockd.c:689:ldlm_handle_ast_error()) ### client (nid 10.8.27.23@o2ib6) returned error from blocking AST (req status -107 rc -107), evict it ns: mdt-regal-MDT0000_UUID lock: ffff880223094d00/0x6c9a0352c4202513 lrc: 4/0,0 mode: PR/PR res: [0x2000772e5:0x8a:0x0].0x0 bits 0x13 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.8.27.23@o2ib6 remote: 0xd278f5eabd47f9e6 expref: 14 pid: 35256 timeout: 7453544476 lvb_type: 0 Apr 19 17:43:24 regal-mds1 kernel: LustreError: 138-a: regal-MDT0000: A client on nid 10.8.27.23@o2ib6 was evicted due to a lock blocking callback time out: rc -107 My colleague had that following command blocked during that time: [root@sh-hn01 regal] # mkdir /regal/.deleted/hasantos/ But no logs on the client (2.12). Access to /regal was restored after a few minutes.

            LU-12064 has been opened to look at possible adjustments to the adaptive timeout code.  Note this is not a fix for this issue, just something to hopefully reduce delays.

            pfarrell Patrick Farrell (Inactive) added a comment - LU-12064 has been opened to look at possible adjustments to the adaptive timeout code.  Note this is not a fix for this issue, just something to hopefully reduce delays.

            Thanks!

            Added that to our /etc/modprobe.d/lustre.conf:

            # LU-11989
            options ptlrpc ldlm_enqueue_min=30
            options ptlrpc at_max=20
            

            And restarted MGS +MDTs.

            $ clush -w@mds -b cat /sys/module/ptlrpc/parameters/ldlm_enqueue_min 
            ---------------
            fir-md1-s[1-2] (2)
            ---------------
            30
            $ clush -w@mds -b cat /sys/module/ptlrpc/parameters/at_max
            ---------------
            fir-md1-s[1-2] (2)
            ---------------
            20
            
            sthiell Stephane Thiell added a comment - Thanks! Added that to our /etc/modprobe.d/lustre.conf: # LU-11989 options ptlrpc ldlm_enqueue_min=30 options ptlrpc at_max=20 And restarted MGS +MDTs. $ clush -w@mds -b cat /sys/module/ptlrpc/parameters/ldlm_enqueue_min --------------- fir-md1-s[1-2] (2) --------------- 30 $ clush -w@mds -b cat /sys/module/ptlrpc/parameters/at_max --------------- fir-md1-s[1-2] (2) --------------- 20

            People

              pjones Peter Jones
              sthiell Stephane Thiell
              Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: