Details

    • Bug
    • Resolution: Cannot Reproduce
    • Blocker
    • None
    • Lustre 2.12.0
    • None
    • CentOS 7.6, Lustre 2.12.0 clients and servers, some clients with 2.12.0 + patch LU-11964
    • 3
    • 9223372036854775807

    Description

      We are having more issues with a full 2.12 production setup on Sherlock and Fir, we can notice sometimes a global filesystem hang, on all nodes, for at least 30 seconds, often more. The filesystem can run fine for 2 hours and then hang during a few minutes. This is impacting production, especially interactive jobs.

      These filesystem hangs could be related to compute nodes rebooting and matching messages like the following on the MDTs:

      [769459.092993] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784454/real 1550784454]  req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784461 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
      [769459.120452] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      [769473.130314] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784468/real 1550784468]  req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784475 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
      [769473.157759] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      [769494.167799] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784489/real 1550784489]  req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784496 ref 1 fl Rpc:X/2/ffffffff rc 0/-1
      [769494.195248] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      

      I'm not 100% sure but it sounds like when these messages stop on the MDTs, the filesystem comes back online. There is no log on the clients though, as far as I know...

      Please note that we're also in the process of fixing the locking issue described in LU-11964 by deploying a patched 2.12.0.
      Is this a known issue in 2.12? Any patch available that we can try, or suggestions would be welcomed.
      Thanks,
      Stephane

      Attachments

        1. fir-md1-s1_20190715.log
          2.67 MB
        2. fir-md1-s1_dk20190225.log.gz
          21.15 MB
        3. fir-md1-s1_dlmtrace_20190312.log.gz
          704 kB
        4. fir-md1-s1-20190228-1.log.gz
          6.10 MB
        5. fir-md1-s1-20190228-2.log.gz
          747 kB
        6. fir-md1-s1-20190508.log
          1.05 MB
        7. fir-md1-s1-kern-syslog-20190228.log
          598 kB
        8. fir-md1-s2_dlmtrace_20190312.log.gz
          11.62 MB
        9. fir-md1-s2-20190508.log
          573 kB
        10. fir-mdt-grafana-fs-hang_mdt1+3_20190304.png
          fir-mdt-grafana-fs-hang_mdt1+3_20190304.png
          268 kB

        Issue Links

          Activity

            [LU-11989] Global filesystem hangs in 2.12
            adilger Andreas Dilger made changes -
            Resolution New: Cannot Reproduce [ 5 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]

            Hi Olaf – No, we haven't seen such long hangs for quite a while now, which is a good thing. They were very frequent when using mkdir with DNEv2, but the issue seems to have been resolved. We're still using 2.12.3 + a few patches (server-side) but they that have all landed in 2.12.4.

            sthiell Stephane Thiell added a comment - Hi Olaf – No, we haven't seen such long hangs for quite a while now, which is a good thing. They were very frequent when using mkdir with DNEv2, but the issue seems to have been resolved. We're still using 2.12.3 + a few patches (server-side) but they that have all landed in 2.12.4.
            ofaaland Olaf Faaland added a comment -

            Stephane, are you still seeing these long hangs?  Asking because we aren't yet using DNE2 but would like to (originally intend to with Lustre 2.8 but decided against it, and now revisiting the idea).

            ofaaland Olaf Faaland added a comment - Stephane, are you still seeing these long hangs?  Asking because we aren't yet using DNE2 but would like to (originally intend to with Lustre 2.8 but decided against it, and now revisiting the idea).
            jstroik Jesse Stroik added a comment - - edited

            I've been observing something similar when we have clients that mistakenly associated with the MDS using an NID behind a NAT. In this example, nodes came up with lnet binding to a private cluster interface instead of an infiniband interface.

            With a lustre 2.10 client and 2.12 server, we can reproduce this behavior easily. The file system hangs for all clients. A relevant snippet from our logs is included - let me know if I can provide any additional information.

             

            Nov 26 22:04:36 apollo-mds kernel: [4167831.563633] Lustre: 62902:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1574805865/real 1574805865] req@ffff8bc8d0153a80 x1646935336777456/t0(0) o104->apollo-MDT0000@10.23.255.254@tcp:15/16 lens 296/224 e 0 to 1 dl 1574805876 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 
            Nov 26 22:04:36 apollo-mds kernel: [4167831.563643] Lustre: 62902:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 94 previous similar messages 
            Nov 26 22:07:12 apollo-mds kernel: [4167986.760003] Lustre: apollo-MDT0000: Connection restored to 2e47e1dd-58f1-ea9e-4af1-7b2ae2de0c79 (at 10.23.255.254@tcp) 
            Nov 26 22:07:12 apollo-mds kernel: [4167986.760011] Lustre: Skipped 22 previous similar messages 
            Nov 26 22:09:33 apollo-mds kernel: [4168128.561768] LustreError: 77047:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8bc6fcff4800 ns: mdt-apollo-MDT0000_UUID lock: ffff8bce53dcc800/0xe6373f3e00450a97 lrc: 3/0,0 mode: PR/PR res: [0x2000013a1:0x1:0x0].0x0 bits 0x13/0x0 rrc: 19 type: IBT flags: 0x50200400000020 nid: 10.23.255.218@tcp remote: 0x195afc50b177c9d2 expref: 6 pid: 77047 timeout: 0 lvb_type: 0 
            Nov 26 22:14:24 apollo-mds kernel: [4168419.322349] LustreError: 59085:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.23.255.218@tcp) failed to reply to blocking AST (req@ffff8bb79b703a80 x1646935336873728 status 0 rc -110), evict it ns: mdt-apollo-MDT0000_UUID lock: ffff8bc53a36e9c0/0xe6373f3e00514e3a lrc: 4/0,0 mode: PR/PR res: [0x20000c990:0x2302:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 10.23.255.218@tcp remote: 0x195afc50b177caff expref: 531 pid: 67366 timeout: 4168744 lvb_type: 0
            
            jstroik Jesse Stroik added a comment - - edited I've been observing something similar when we have clients that mistakenly associated with the MDS using an NID behind a NAT. In this example, nodes came up with lnet binding to a private cluster interface instead of an infiniband interface. With a lustre 2.10 client and 2.12 server, we can reproduce this behavior easily. The file system hangs for all clients. A relevant snippet from our logs is included - let me know if I can provide any additional information.   Nov 26 22:04:36 apollo-mds kernel: [4167831.563633] Lustre: 62902:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1574805865/real 1574805865] req@ffff8bc8d0153a80 x1646935336777456/t0(0) o104->apollo-MDT0000@10.23.255.254@tcp:15/16 lens 296/224 e 0 to 1 dl 1574805876 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Nov 26 22:04:36 apollo-mds kernel: [4167831.563643] Lustre: 62902:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 94 previous similar messages Nov 26 22:07:12 apollo-mds kernel: [4167986.760003] Lustre: apollo-MDT0000: Connection restored to 2e47e1dd-58f1-ea9e-4af1-7b2ae2de0c79 (at 10.23.255.254@tcp) Nov 26 22:07:12 apollo-mds kernel: [4167986.760011] Lustre: Skipped 22 previous similar messages Nov 26 22:09:33 apollo-mds kernel: [4168128.561768] LustreError: 77047:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8bc6fcff4800 ns: mdt-apollo-MDT0000_UUID lock: ffff8bce53dcc800/0xe6373f3e00450a97 lrc: 3/0,0 mode: PR/PR res: [0x2000013a1:0x1:0x0].0x0 bits 0x13/0x0 rrc: 19 type: IBT flags: 0x50200400000020 nid: 10.23.255.218@tcp remote: 0x195afc50b177c9d2 expref: 6 pid: 77047 timeout: 0 lvb_type: 0 Nov 26 22:14:24 apollo-mds kernel: [4168419.322349] LustreError: 59085:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.23.255.218@tcp) failed to reply to blocking AST (req@ffff8bb79b703a80 x1646935336873728 status 0 rc -110), evict it ns: mdt-apollo-MDT0000_UUID lock: ffff8bc53a36e9c0/0xe6373f3e00514e3a lrc: 4/0,0 mode: PR/PR res: [0x20000c990:0x2302:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 10.23.255.218@tcp remote: 0x195afc50b177caff expref: 531 pid: 67366 timeout: 4168744 lvb_type: 0
            pjones Peter Jones made changes -
            Assignee Original: Patrick Farrell [ pfarrell ] New: Peter Jones [ pjones ]

            This issue is still ongoing, and still seems to happen when we use mkdir with DNEv2 (each time we hit this issue there is a mkdir using DNEv2 running on sh-hn01 (10.9.0.1@o2ib4)...

            The full filesystem was blocked again today for more than 10 minutes while a mkdir was running there. There are some backtraces like that when things started to recover:

            Jul 15 15:34:26 fir-md1-s1 kernel: LNet: Service thread pid 23077 was inactive for 200.25s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            Jul 15 15:34:26 fir-md1-s1 kernel: LNet: Skipped 2 previous similar messages
            Jul 15 15:34:26 fir-md1-s1 kernel: Pid: 23077, comm: mdt02_042 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
            Jul 15 15:34:26 fir-md1-s1 kernel: Call Trace:
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc0d92e55>] ldlm_completion_ast+0x4e5/0x890 [ptlrpc]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc0d93cbc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc14744bb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc1474b40>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc1474f0c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc148c44c>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc148fd88>] mdt_reint_setattr+0x6c8/0x1340 [mdt]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc1491ee3>] mdt_reint_rec+0x83/0x210 [mdt]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc1470143>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc147b4a7>] mdt_reint+0x67/0x140 [mdt]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc0e3073a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc0dd4d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffc0dd863c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffff84ec1c31>] kthread+0xd1/0xe0
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffff85574c24>] ret_from_fork_nospec_begin+0xe/0x21
            Jul 15 15:34:26 fir-md1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
            Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1563230066.23077
            Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: 20378:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 29s: evicting client at 10.8.9.9@o2ib6  ns: mdt-fir-MDT0000_UUID lock: ffff8f31c0f9e540/0x5d9ee640b61b6f03 lrc: 3/0,0 mode: PR/PR res: [0x2000297f6:0x882:0x0].0x0 bits 0x5b/0x0 rrc: 8 type: IBT flags: 0x60200400000020 nid: 10.8.9.9@o2ib6 remote: 0x243de7f26418e7fc expref: 16849 pid: 25680 timeout: 2345126 lvb_type: 0
            Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: 20378:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 10 previous similar messages
            Jul 15 15:34:36 fir-md1-s1 kernel: LNet: Service thread pid 23077 completed after 209.64s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            

            I'm attaching MDS logs `fir-md1-s1` as fir-md1-s1_20190715.log (server where is hosted MDT0000, MDT0002 and the MGT).

            WARNING: this is the full kernel logs since last MDS restart, June 18, 2019!!! Not too bad for 2.12!! Our event started on July 15, 2019 at about 15:20 in the log. First user ticket arrived at 15:24.

            The mkdir that was blocked during this event was on the following directory, automatically stripped on MDT0, which eventually succeeded:

            [root@sh-hn01 sthiell.root]# lfs getdirstripe /fir/users/kmoulin
            lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
            

            As a reminder:

            [root@sh-hn01 sthiell.root]# lfs getdirstripe /fir/users
            lmv_stripe_count: 4 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
            mdtidx		 FID[seq:oid:ver]
                 0		 [0x200000400:0x5:0x0]		
                 1		 [0x240000402:0x5:0x0]		
                 2		 [0x2c0000400:0x5:0x0]		
                 3		 [0x280000401:0x5:0x0]	
            

            The good thing is that the MDS recovered by itself but it took some time.... at least 10 minutes.

            Thanks,
            Stephane

             

            sthiell Stephane Thiell added a comment - This issue is still ongoing, and still seems to happen when we use mkdir with DNEv2 (each time we hit this issue there is a mkdir using DNEv2 running on sh-hn01 (10.9.0.1@o2ib4)... The full filesystem was blocked again today for more than 10 minutes while a mkdir was running there. There are some backtraces like that when things started to recover: Jul 15 15:34:26 fir-md1-s1 kernel: LNet: Service thread pid 23077 was inactive for 200.25s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jul 15 15:34:26 fir-md1-s1 kernel: LNet: Skipped 2 previous similar messages Jul 15 15:34:26 fir-md1-s1 kernel: Pid: 23077, comm: mdt02_042 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Jul 15 15:34:26 fir-md1-s1 kernel: Call Trace: Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc0d92e55>] ldlm_completion_ast+0x4e5/0x890 [ptlrpc] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc0d93cbc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc14744bb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc1474b40>] mdt_object_lock_internal+0x70/0x3e0 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc1474f0c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc148c44c>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc148fd88>] mdt_reint_setattr+0x6c8/0x1340 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc1491ee3>] mdt_reint_rec+0x83/0x210 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc1470143>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc147b4a7>] mdt_reint+0x67/0x140 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc0e3073a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc0dd4d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc0dd863c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffff84ec1c31>] kthread+0xd1/0xe0 Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffff85574c24>] ret_from_fork_nospec_begin+0xe/0x21 Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1563230066.23077 Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: 20378:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 29s: evicting client at 10.8.9.9@o2ib6 ns: mdt-fir-MDT0000_UUID lock: ffff8f31c0f9e540/0x5d9ee640b61b6f03 lrc: 3/0,0 mode: PR/PR res: [0x2000297f6:0x882:0x0].0x0 bits 0x5b/0x0 rrc: 8 type: IBT flags: 0x60200400000020 nid: 10.8.9.9@o2ib6 remote: 0x243de7f26418e7fc expref: 16849 pid: 25680 timeout: 2345126 lvb_type: 0 Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: 20378:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 10 previous similar messages Jul 15 15:34:36 fir-md1-s1 kernel: LNet: Service thread pid 23077 completed after 209.64s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). I'm attaching MDS logs `fir-md1-s1` as fir-md1-s1_20190715.log (server where is hosted MDT0000, MDT0002 and the MGT). WARNING: this is the full kernel logs since last MDS restart, June 18, 2019!!! Not too bad for 2.12!! Our event started on July 15, 2019 at about 15:20 in the log. First user ticket arrived at 15:24. The mkdir that was blocked during this event was on the following directory, automatically stripped on MDT0, which eventually succeeded: [root@sh-hn01 sthiell.root]# lfs getdirstripe /fir/users/kmoulin lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none As a reminder: [root@sh-hn01 sthiell.root]# lfs getdirstripe /fir/users lmv_stripe_count: 4 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64 mdtidx FID[seq:oid:ver] 0 [0x200000400:0x5:0x0] 1 [0x240000402:0x5:0x0] 2 [0x2c0000400:0x5:0x0] 3 [0x280000401:0x5:0x0] The good thing is that the MDS recovered by itself but it took some time.... at least 10 minutes. Thanks, Stephane  
            sthiell Stephane Thiell made changes -
            Attachment New: fir-md1-s1_20190715.log [ 33210 ]

            We still had the issue this morning, this was when Kilian was creating a bunch of new accounts (mkdir) along with rebooting nodes. mkdir was done in /fir/users and /fir/groups which are DNE2 enabled.

            Not everything was blocked, but find in /fir/users and /fir/groups was hanging on all compute nodes. However, I was able to browse /fir/users/sthiell/ without issue (DNE1 only).

             

            Attaching full kernel logs from both MDS as fir-md1-s1-20190508.log and fir-md1-s2-20190508.log. The issue happened around May 08 11:23 in the logs which can particularly be seen in fir-md1-s2-20190508.log

             

            After a few minutes, things went back to normal.

            sthiell Stephane Thiell added a comment - We still had the issue this morning, this was when Kilian was creating a bunch of new accounts (mkdir) along with rebooting nodes. mkdir was done in /fir/users and /fir/groups which are DNE2 enabled. Not everything was blocked, but find in /fir/users and /fir/groups was hanging on all compute nodes. However, I was able to browse /fir/users/sthiell/ without issue (DNE1 only).   Attaching full kernel logs from both MDS as fir-md1-s1-20190508.log  and  fir-md1-s2-20190508.log . The issue happened around May 08 11:23 in the logs which can particularly be seen in fir-md1-s2-20190508.log   After a few minutes, things went back to normal.
            sthiell Stephane Thiell made changes -
            Attachment New: fir-md1-s2-20190508.log [ 32549 ]
            sthiell Stephane Thiell made changes -
            Attachment New: fir-md1-s1-20190508.log [ 32548 ]

            People

              pjones Peter Jones
              sthiell Stephane Thiell
              Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: