Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12037

Possible DNE issue leading to hung filesystem

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.13.0, Lustre 2.12.1
    • Lustre 2.12.0
    • None
    • CentOS 7.6 - Servers: 2.12.0 Clients: 2.12.0 + patch LU-11964
    • 2
    • 9223372036854775807

    Description

      A new issue with 2.12 started first with directories not accessible from a specific client (10.9.0.1@o2ib4) and ended up with a full filesystem deadlock. Restarting all MDTs resolved (temporarily) the issue...

      This is the console of fir-md1-s1 (serves MDT0000 and MDT0002), when the directories started to become unaccessible:

      [1378713.475739] LNet: Service thread pid 48757 was inactive for 212.00s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [1378713.488774] LNet: Skipped 3 previous similar messages
      [1378713.494012] LustreError: dumping log to /tmp/lustre-log.1551393702.48757
      [1378716.547830] LustreError: dumping log to /tmp/lustre-log.1551393705.22268
      [1378721.106914] Lustre: fir-MDT0000-osp-MDT0002: Connection to fir-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
      [1378721.108926] LustreError: 51417:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393409, 300s ago); not entering recovery in server code, just going back to sleep ns:
      [1378721.161814] LustreError: 51459:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393409, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
      [1378721.202150] Lustre: fir-MDT0000: Received new LWP connection from 0@lo, removing former export from same NID
      [1378721.943952] LustreError: 22249:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393410, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
      [1378721.984222] LustreError: 22249:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 2 previous similar messages
      [1378724.392997] LustreError: 51445:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393413, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
      [1378728.847103] LustreError: 22142:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393417, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
      [1378729.910121] LustreError: 21796:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393418, 300s ago); not entering recovery in server code, just going back to sleep ns:
      [1378729.949613] LustreError: 21796:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 4 previous similar messages
      [1378736.197272] LustreError: 21546:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393424, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
      [1378743.454431] LustreError: 21974:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393432, 300s ago); not entering recovery in server code, just going back to sleep ns:
      [1378753.250665] LustreError: 22233:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393441, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
      [1378753.290920] LustreError: 22233:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 2 previous similar messages
      [1378771.845093] LNet: Service thread pid 22241 was inactive for 236.44s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [1378771.858133] LNet: Skipped 1 previous similar message
      [1378771.863282] LustreError: dumping log to /tmp/lustre-log.1551393760.22241
      [1378774.917158] LustreError: dumping log to /tmp/lustre-log.1551393763.21826
      [1378789.253489] LustreError: dumping log to /tmp/lustre-log.1551393777.51431
      [1378796.422652] LustreError: dumping log to /tmp/lustre-log.1551393785.22159
      [1378800.517748] LustreError: dumping log to /tmp/lustre-log.1551393789.47900
      [1378801.468777] LustreError: 48757:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393490, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
      [1378801.509027] LustreError: 48757:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 3 previous similar messages
      [1378801.541773] LustreError: dumping log to /tmp/lustre-log.1551393790.21878
      [1378833.603516] LustreError: 22159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393522, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD
      [1378833.643766] LustreError: 22159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 2 previous similar messages
      [1378878.343542] LNet: Service thread pid 22157 was inactive for 313.10s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [1378878.356579] LNet: Skipped 5 previous similar messages
      [1378878.361831] LustreError: dumping log to /tmp/lustre-log.1551393867.22157
      [1378938.760938] LNet: Service thread pid 21779 was inactive for 362.88s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [1378938.778069] LNet: Skipped 3 previous similar messages
      [1378938.783305] Pid: 21779, comm: mdt01_015 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
      [1378938.793247] Call Trace:
      [1378938.795883]  [<ffffffffc12340bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
      [1378938.802998]  [<ffffffffc123612f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
      [1378938.810278]  [<ffffffffc12389ee>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
      [1378938.817141]  [<ffffffffc16507a2>] osp_md_object_lock+0x162/0x2d0 [osp]
      [1378938.823899]  [<ffffffffc1a7be03>] lod_object_lock+0xf3/0x7b0 [lod]
      [1378938.830299]  [<ffffffffc1afcd3e>] mdd_object_lock+0x3e/0xe0 [mdd]
      [1378938.836613]  [<ffffffffc199b2e1>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
      [1378938.844053]  [<ffffffffc199b87a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
      [1378938.850995]  [<ffffffffc19bb022>] mdt_reint_rename_or_migrate.isra.51+0x362/0x860 [mdt]
      [1378938.859211]  [<ffffffffc19bb553>] mdt_reint_rename+0x13/0x20 [mdt]
      [1378938.865606]  [<ffffffffc19bb5e3>] mdt_reint_rec+0x83/0x210 [mdt]
      [1378938.871828]  [<ffffffffc1998133>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      [1378938.878586]  [<ffffffffc19a3497>] mdt_reint+0x67/0x140 [mdt]
      [1378938.884456]  [<ffffffffc12d135a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [1378938.891594]  [<ffffffffc127592b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [1378938.899485]  [<ffffffffc127925c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      [1378938.905998]  [<ffffffff8e8c1c31>] kthread+0xd1/0xe0
      [1378938.911090]  [<ffffffff8ef74c24>] ret_from_fork_nospec_begin+0xe/0x21
      [1378938.917767]  [<ffffffffffffffff>] 0xffffffffffffffff
      [1378938.922966] LustreError: dumping log to /tmp/lustre-log.1551393927.21779
      
      

      MDTs restarted occurred at ~15:47

      I was able to gather some logs:

      • a tar of the lustre-log files above (automatically generated) is available in the FTP (fir-md1-s1-lustre-log-20190228.tar.gz)
      • I used the commands that Patrick provided in LU-11989, I did it two times, the results is available in fir-md1-s1-20190228-1.log.gz and fir-md1-s1-20190228-2.log.gz attached to this ticket
      • fir-md1-s1 full kernel syslog dump in fir-md1-s1-kern-syslog-20190228.log

      A bit more about our DNE configuration:
      We use DNEv2 only to balance groups and users directories on different MDT (but only 1 each), but that's all. Basically:

      $ lfs getdirstripe /fir
      lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
      $ lfs getdirstripe /fir/users
      lmv_stripe_count: 4 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
      mdtidx           FID[seq:oid:ver]
           0           [0x200000400:0x5:0x0]  
           1           [0x240000402:0x5:0x0]  
           2           [0x2c0000400:0x5:0x0]  
           3           [0x280000401:0x5:0x0]  
      $ lfs getdirstripe /fir/groups
      lmv_stripe_count: 4 lmv_stripe_offset: 1 lmv_hash_type: fnv_1a_64
      mdtidx           FID[seq:oid:ver]
           1           [0x240000400:0x8:0x0]  
           2           [0x2c0000401:0x6:0x0]  
           3           [0x280000403:0x2:0x0]  
           0           [0x200000401:0x6:0x0]  
      $ lfs getdirstripe /fir/users/sthiell
      lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
      $ lfs getdirstripe /fir/users/kilian
      lmv_stripe_count: 0 lmv_stripe_offset: 1 lmv_hash_type: none
      

      Thanks!

      Attachments

        1. dk.fir-md1-s1.gz
          13.62 MB
        2. dk-full.fir-md1-s1.gz
          17.88 MB
        3. fir-md1-s1_20190304-dk.log.gz
          7.55 MB
        4. fir-md1-s1_20190403-dk.log.gz
          3.00 MB
        5. fir-md1-s1_20190412.log
          4.70 MB
        6. fir-md1-s1_console_full_20190309.log.gz
          424 kB
        7. fir-md1-s1_dkdlmtrace_20190314.log.gz
          46 kB
        8. fir-md1-s1_dkdlmtrace_20190327.log.gz
          562 kB
        9. fir-md1-s1_dkdlmtrace_20190327.log.gz
          562 kB
        10. fir-md1-s1_kern_20190314.log.gz
          155 kB
        11. fir-md1-s1_sysrq-t_20190308.log.gz
          343 kB
        12. fir-md1-s1-20190228-1.log.gz
          6.10 MB
        13. fir-md1-s1-20190228-2.log.gz
          747 kB
        14. fir-md1-s1-console-20190403.gz
          375 kB
        15. fir-md1-s1-kern_20190327.log.gz
          172 kB
        16. fir-md1-s1-kern_20190327.log.gz
          172 kB
        17. fir-md1-s1-kern-syslog-20190228.log
          598 kB
        18. fir-md1-s2_20190304-dk.log.gz
          1.75 MB
        19. fir-md1-s2_20190403-dk.log.gz
          1.76 MB
        20. fir-md1-s2_console_full_20190309.log.gz
          516 kB
        21. fir-md1-s2_dkdlmtrace_20190314.log.gz
          2.51 MB
        22. fir-md1-s2_dkdlmtrace_20190327.log.gz
          9.69 MB
        23. fir-md1-s2_dkdlmtrace_20190327.log.gz
          9.69 MB
        24. fir-md1-s2_kern_20190314.log.gz
          179 kB
        25. fir-md1-s2_sysrq-t_20190308.log.gz
          358 kB
        26. fir-md1-s2-console-20190403.gz
          362 kB
        27. fir-md1-s2-dlmtrace_20190311.log.gz
          4.26 MB
        28. fir-md1-s2-kern_20190327.log.gz
          133 kB
        29. fir-md1-s2-kern_20190327.log.gz
          133 kB

        Issue Links

          Activity

            [LU-12037] Possible DNE issue leading to hung filesystem

            Because mdt.*.enable_remote_rename=0 was impacting performance, as it took a very long time for some users to mv within the filesystem, we decided to set enable_remote_rename=1 again. We did that last week and we have survived until now. It was a wrong track I guess. This is the other fixes that really helped like async_discard or LU-11285 "ldlm: reprocess whole waiting queue for IBITS". I think this one can be closed as soon as LU-11285 has landed...

            sthiell Stephane Thiell added a comment - Because mdt.*.enable_remote_rename=0 was impacting performance, as it took a very long time for some users to mv within the filesystem, we decided to set enable_remote_rename=1 again. We did that last week and we have survived until now. It was a wrong track I guess. This is the other fixes that really helped like async_discard or LU-11285 "ldlm: reprocess whole waiting queue for IBITS". I think this one can be closed as soon as LU-11285 has landed...
            pjones Peter Jones added a comment -

            Yes I think that switching to LU-11359 makes sense

            pjones Peter Jones added a comment - Yes I think that switching to LU-11359 makes sense
            sthiell Stephane Thiell added a comment - - edited

            Ok no problem. Should I update in LU-11359 regarding: the DoM locking issue (mdt_dom_discard_data) then? We only applied the patch on ~400 clients so far, so the issue isn't resolved yet. Today was very difficult, a lot of issues again, we'll patch more aggressively the rest of the ~1000 clients tomorrow with https://review.whamcloud.com/#/c/34071/11/

            Also, we hit LU-12120 while restarting the MDTs multiple times today while trying to resolve the DOM locking issues:

            fir-io1-s1:[Sun Apr 21 17:06:43 2019][6239672.584384] LustreError: 1474:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
            fir-io1-s1:[Sun Apr 21 18:49:41 2019][ 6062.486131] LustreError: 65943:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
            fir-io1-s2:[Sun Apr 21 16:46:46 2019][6242707.929072] LustreError: 108555:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
            fir-io2-s2:[Sun Apr 21 16:50:04 2019][8112352.271586] LustreError: 3051:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
            fir-io3-s1:[Sun Apr 21 16:49:47 2019][8112770.406551] LustreError: 65916:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
            fir-io3-s1:[Sun Apr 21 16:49:47 2019][8112770.417849] LustreError: 65963:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
            fir-io3-s1:[Sun Apr 21 16:49:47 2019][8112770.443818] LustreError: 62284:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
            fir-io3-s2:[Sun Apr 21 17:06:45 2019][6239507.696718] LustreError: 37106:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG
            

            After many restart attempts, the filesystem seems to have stabilized at this point, for now...

            sthiell Stephane Thiell added a comment - - edited Ok no problem. Should I update in LU-11359 regarding: the DoM locking issue ( mdt_dom_discard_data ) then? We only applied the patch on ~400 clients so far, so the issue isn't resolved yet. Today was very difficult, a lot of issues again, we'll patch more aggressively the rest of the ~1000 clients tomorrow with https://review.whamcloud.com/#/c/34071/11/ Also, we hit LU-12120 while restarting the MDTs multiple times today while trying to resolve the DOM locking issues: fir-io1-s1:[Sun Apr 21 17:06:43 2019][6239672.584384] LustreError: 1474:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG fir-io1-s1:[Sun Apr 21 18:49:41 2019][ 6062.486131] LustreError: 65943:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG fir-io1-s2:[Sun Apr 21 16:46:46 2019][6242707.929072] LustreError: 108555:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG fir-io2-s2:[Sun Apr 21 16:50:04 2019][8112352.271586] LustreError: 3051:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG fir-io3-s1:[Sun Apr 21 16:49:47 2019][8112770.406551] LustreError: 65916:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG fir-io3-s1:[Sun Apr 21 16:49:47 2019][8112770.417849] LustreError: 65963:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG fir-io3-s1:[Sun Apr 21 16:49:47 2019][8112770.443818] LustreError: 62284:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG fir-io3-s2:[Sun Apr 21 17:06:45 2019][6239507.696718] LustreError: 37106:0:(tgt_grant.c:565:tgt_grant_incoming()) LBUG After many restart attempts, the filesystem seems to have stabilized at this point, for now...
            pjones Peter Jones added a comment -

            ok, for release tracking purposes, let's mark this ticket resolved because both patches have landed for 2.13 and will be in 2.12.1. There seem to be no shortage of other tickets if we need further action still.

            pjones Peter Jones added a comment - ok, for release tracking purposes, let's mark this ticket resolved because both patches have landed for 2.13 and will be in 2.12.1. There seem to be no shortage of other tickets if we need further action still.

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34616/
            Subject: LU-12037 mdt: add option for cross-MDT rename
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 9cbefa99f234586c1419243a4d2c136e4f405b29

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34616/ Subject: LU-12037 mdt: add option for cross-MDT rename Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 9cbefa99f234586c1419243a4d2c136e4f405b29

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34410/
            Subject: LU-12037 mdt: add option for cross-MDT rename
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: c6f3d533542b9462b5b8df95183d80321d4d9c34

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34410/ Subject: LU-12037 mdt: add option for cross-MDT rename Project: fs/lustre-release Branch: master Current Patch Set: Commit: c6f3d533542b9462b5b8df95183d80321d4d9c34

            FYI, we had another "event" so we decided to update the servers and now we're slowly starting to redeploy clients, this will take a while.

            Our first client with async_discard:

            [root@sh-101-60 ~]# lctl get_param mdc.fir-*.import | grep -q async_discard && echo ok
            ok
            

            Thanks much for your efforts! Fingers crossed.

            sthiell Stephane Thiell added a comment - FYI, we had another "event" so we decided to update the servers and now we're slowly starting to redeploy clients, this will take a while. Our first client with async_discard: [root@sh-101-60 ~]# lctl get_param mdc.fir-*.import | grep -q async_discard && echo ok ok Thanks much for your efforts! Fingers crossed.

            Will try the following on top of 2.12.0:

            servers (fir):

            $ git log --oneline
            1819063 LU-11359 mdt: fix mdt_dom_discard_data() timeouts
            3ed10f4 LU-11964 mdc: prevent glimpse lock count grow
            565011c LU-12037 mdt: add option for cross-MDT rename
            b6be1d9 LU-12065 lnd: increase CQ entries
            6b2c97b LU-12037 mdt: call mdt_dom_discard_data() after rename unlock
            

            clients (sherlock):

            8a47fe6 LU-11359 mdt: fix mdt_dom_discard_data() timeouts
            8bf1d95 LU-12065 lnd: increase CQ entries
            7b8a3b3 LU-11964 mdc: prevent glimpse lock count grow
            
            sthiell Stephane Thiell added a comment - Will try the following on top of 2.12.0: servers (fir): $ git log --oneline 1819063 LU-11359 mdt: fix mdt_dom_discard_data() timeouts 3ed10f4 LU-11964 mdc: prevent glimpse lock count grow 565011c LU-12037 mdt: add option for cross-MDT rename b6be1d9 LU-12065 lnd: increase CQ entries 6b2c97b LU-12037 mdt: call mdt_dom_discard_data() after rename unlock clients (sherlock): 8a47fe6 LU-11359 mdt: fix mdt_dom_discard_data() timeouts 8bf1d95 LU-12065 lnd: increase CQ entries 7b8a3b3 LU-11964 mdc: prevent glimpse lock count grow

            Thanks!!

            This does compile against my 2.12.0 branch: https://gist.github.com/thiell/ce2445aa1e37b365e50208ebbb0b5263

            Let me know if that sounds ok to you. I will build new packages after lunch.

            Thanks for the remote lock explanation, this is super interesting and I'm glad to hear that it doesn't necessary mean a remote rename!!

            sthiell Stephane Thiell added a comment - Thanks!! This does compile against my 2.12.0 branch: https://gist.github.com/thiell/ce2445aa1e37b365e50208ebbb0b5263 Let me know if that sounds ok to you. I will build new packages after lunch. Thanks for the remote lock explanation, this is super interesting and I'm glad to hear that it doesn't necessary mean a remote rename!!

            Stephane,

            You can replace ldlm_is_granted(lock) with (lock->l_req_mode == lock->l_granted_mode).

            About the remote lock:
            If, as in the case from earlier, it's inside mdt_rename_lock and it's on a secondary MDT (not MDT0), then it's not an issue.  [Edit: Sorry, too brief....] It's just trying to get the rename lock on MDT0.  That does indicate some sort of hang if you're getting back traces, but it doesn't actually indicate a remote rename.  There is only a single rename lock ("big file system lock" or BFL, as Andreas labels it) for the whole file system.  This is kind of standard (EXT4, at least, is the same) as it massively simplifies rename locking (which is still quite complicated).

            pfarrell Patrick Farrell (Inactive) added a comment - - edited Stephane, You can replace ldlm_is_granted(lock) with (lock->l_req_mode == lock->l_granted_mode). About the remote lock: If, as in the case from earlier, it's inside mdt_rename_lock and it's on a secondary MDT (not MDT0), then it's not an issue.  [Edit: Sorry, too brief....] It's just trying to get the rename lock on MDT0.  That does indicate some sort of hang if you're getting back traces, but it doesn't actually indicate a remote rename.  There is only a single rename lock ("big file system lock" or BFL, as Andreas labels it) for the whole file system.  This is kind of standard (EXT4, at least, is the same) as it massively simplifies rename locking (which is still quite complicated).

            People

              laisiyao Lai Siyao
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: