Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11964

Heavy load and soft lockups on MDS with DOM

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.13.0, Lustre 2.12.1
    • Lustre 2.12.0
    • None
    • CentOS 7.6, Linux Kernel 3.10.0-957.1.3.el7_lustre.x86_64, Lustre 2.12 clients (Sherlock cluster)
    • 2
    • 9223372036854775807

    Description

      On Fir, our new scratch for Sherlock running 2.12, the filesystem hung today and then operations resumed. Fir is using DNE, PFL and DOM. The MDS hosting mdt[0,2] seems to have been under a heavy load. For example:

      [481509.742936] Call Trace:
      [481509.745520]  [<ffffffffc0f73303>] ldlm_process_inodebits_lock+0x93/0x3d0 [ptlrpc]
      [481509.753123]  [<ffffffffc0f61fc0>] ? ldlm_export_lock_object+0x10/0x10 [ptlrpc]
      [481509.760464]  [<ffffffffc0f73270>] ? ldlm_inodebits_compat_queue+0x440/0x440 [ptlrpc]
      [481509.768321]  [<ffffffffc0f4160e>] ldlm_reprocess_queue+0x1be/0x3f0 [ptlrpc]
      [481509.775403]  [<ffffffffc0f426cb>] __ldlm_reprocess_all+0x10b/0x380 [ptlrpc]
      [481509.782482]  [<ffffffffc0f434a2>] ldlm_cancel_lock_for_export.isra.26+0x1c2/0x390 [ptlrpc]
      [481509.790863]  [<ffffffffc0f43851>] ldlm_export_cancel_blocked_locks+0x121/0x200 [ptlrpc]
      [481509.798987]  [<ffffffffc0f6c6e2>] ldlm_bl_thread_main+0x112/0x700 [ptlrpc]
      [481509.805950]  [<ffffffff9d6d67b0>] ? wake_up_state+0x20/0x20
      [481509.811650]  [<ffffffffc0f6c5d0>] ? ldlm_handle_bl_callback+0x530/0x530 [ptlrpc]
      [481509.819130]  [<ffffffff9d6c1c31>] kthread+0xd1/0xe0
      [481509.824095]  [<ffffffff9d6c1b60>] ? insert_kthread_work+0x40/0x40
      [481509.830277]  [<ffffffff9dd74c24>] ret_from_fork_nospec_begin+0xe/0x21
      [481509.836799]  [<ffffffff9d6c1b60>] ? insert_kthread_work+0x40/0x40
      [481509.842979] Code: 74 0e 4c 89 f2 48 89 de 4c 89 ff e8 d3 a2 fc ff 49 8b 87 10 02 00 00 4c 8d a8 f0 fd ff ff 4d 39 fd 74 2b 49 83 bd a8 00 00 00 00 <74> 0e 4c 89 f2 48 89 de 4c 89 ef e8 a8 a2 fc ff 4d 8b ad 10 02 
      [481516.050304] INFO: rcu_sched self-detected stall on CPU
      [481516.051302] INFO: rcu_sched self-detected stall on CPU
      [481516.051303]  {
      [481516.051303]  15
      [481516.051304] } 
      [481516.051314]  (t=60001 jiffies g=115591743 c=115591742 q=6728165)
      [481516.051315] Task dump for CPU 14:
      [481516.051316] ldlm_bl_01      R
      [481516.051316]   running task    
      [481516.051318]     0 54053      2 0x00000088
      [481516.051318] Call Trace:
      [481516.051362]  [<ffffffffc0f43851>] ? ldlm_export_cancel_blocked_locks+0x121/0x200 [ptlrpc]
      [481516.051394]  [<ffffffffc0f6c6e2>] ? ldlm_bl_thread_main+0x112/0x700 [ptlrpc]
      [481516.051397]  [<ffffffff9d6d67b0>] ? wake_up_state+0x20/0x20
      [481516.051426]  [<ffffffffc0f6c5d0>] ? ldlm_handle_bl_callback+0x530/0x530 [ptlrpc]
      [481516.051427]  [<ffffffff9d6c1c31>] ? kthread+0xd1/0xe0
      [481516.051429]  [<ffffffff9d6c1b60>] ? insert_kthread_work+0x40/0x40
      [481516.051431]  [<ffffffff9dd74c24>] ? ret_from_fork_nospec_begin+0xe/0x21
      [481516.051432]  [<ffffffff9d6c1b60>] ? insert_kthread_work+0x40/0x40
      [481516.051432] Task dump for CPU 15:
      

      filesystem is still running but some threads are at 100% on the MDS

         PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                      
       54053 root      20   0       0      0      0 R 100.0  0.0   2:02.22 ldlm_bl_01                                                                                                                                   
       55962 root      20   0       0      0      0 R 100.0  0.0   2:01.79 ldlm_bl_03                                                                                                                                   
       56043 root      20   0       0      0      0 R 100.0  0.0   3:20.41 ldlm_bl_04                                                                                                                                   
       63820 root      20   0       0      0      0 R 100.0  0.0  10:13.65 mdt03_033  
      
      [root@fir-md1-s1 ~]# lctl get_param ldlm.*.*mdt*.*
      ldlm.namespaces.mdt-fir-MDT0000_UUID.contended_locks=32
      ldlm.namespaces.mdt-fir-MDT0000_UUID.contention_seconds=2
      ldlm.namespaces.mdt-fir-MDT0000_UUID.ctime_age_limit=10
      ldlm.namespaces.mdt-fir-MDT0000_UUID.dirty_age_limit=10
      ldlm.namespaces.mdt-fir-MDT0000_UUID.early_lock_cancel=0
      ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_count=146645
      ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_timeouts=8
      ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_unused_count=0
      ldlm.namespaces.mdt-fir-MDT0000_UUID.lru_max_age=3900000
      ldlm.namespaces.mdt-fir-MDT0000_UUID.lru_size=4800
      ldlm.namespaces.mdt-fir-MDT0000_UUID.max_nolock_bytes=0
      ldlm.namespaces.mdt-fir-MDT0000_UUID.max_parallel_ast=1024
      ldlm.namespaces.mdt-fir-MDT0000_UUID.resource_count=62361
      ldlm.namespaces.mdt-fir-MDT0002_UUID.contended_locks=32
      ldlm.namespaces.mdt-fir-MDT0002_UUID.contention_seconds=2
      ldlm.namespaces.mdt-fir-MDT0002_UUID.ctime_age_limit=10
      ldlm.namespaces.mdt-fir-MDT0002_UUID.dirty_age_limit=10
      ldlm.namespaces.mdt-fir-MDT0002_UUID.early_lock_cancel=0
      ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_count=6472231
      ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_timeouts=166760
      ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_unused_count=0
      ldlm.namespaces.mdt-fir-MDT0002_UUID.lru_max_age=3900000
      ldlm.namespaces.mdt-fir-MDT0002_UUID.lru_size=4800
      ldlm.namespaces.mdt-fir-MDT0002_UUID.max_nolock_bytes=0
      ldlm.namespaces.mdt-fir-MDT0002_UUID.max_parallel_ast=1024
      ldlm.namespaces.mdt-fir-MDT0002_UUID.resource_count=359599
      

      MDS resumed operations by itself, the last logs were:

      [482208.719700] LustreError: 63820:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1549999197, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0002_UUID lock: ffff8995f523f740/0x1a4b7ac741771fee lrc: 3/0,1 mode: --/PW res: [0x2c00033c7:0x2b6e:0x0].0x0 bits 0x40/0x0 rrc: 39993 type: IBT flags: 0x40010080000000 nid: local remote: 0x0 expref: -99 pid: 63820 timeout: 0 lvb_type: 0
      [482361.725697] LNet: Service thread pid 63820 completed after 1225.53s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [482666.099950] Lustre: fir-MDT0002: Client 0ccdc4e2-9749-c9a5-afb4-85874ce74d6c (at 10.0.10.3@o2ib7) reconnecting
      

      I'm attaching the full syslog output.

      Now everything looks good even though this MDS is still quite loaded in terms of CPU.

      Thanks,
      Stephane

      Attachments

        1. fir-md1-s1.syslog-20190212.log.gz
          303 kB
        2. fir-md1-s1-20190213.log
          82 kB
        3. fir-md1-s1-dk.log
          1.27 MB
        4. fir-md1-s2-dk.log
          7.15 MB
        5. lustre-log.1550078704.21794
          39 kB
        6. sh-106-60.dk.log.gz
          4.11 MB
        7. sherlocks.txt
          17 kB

        Issue Links

          Activity

            [LU-11964] Heavy load and soft lockups on MDS with DOM
            pjones Peter Jones added a comment -

            That is good news Stephane. Is the LU-11964 patch the only one applied on top of the vanilla 2.12.0 - or is there anything else?

            pjones Peter Jones added a comment - That is good news Stephane. Is the LU-11964 patch the only one applied on top of the vanilla 2.12.0 - or is there anything else?

            As of tonight, the situation is stable, and we didn't have any new server crash (yay). So I guess the patch is definitively working. I can't thank you enough Patrick and Mike! The lock counters are still relatively high but there are ~6,300 jobs running at the moment and probably most of them using /fir so I assume this is normal:

            $ clush -w@mds -L lctl get_param ldlm.*.*mdt*.lock_count 
            fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_count=439815
            fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_count=241967
            fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_count=1138179
            fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0003_UUID.lock_count=623797
            
            [root@fir-md1-s2 ~]# lctl get_param ldlm.namespaces.mdt-fir-MDT0001_UUID.*
            ldlm.namespaces.mdt-fir-MDT0001_UUID.contended_locks=32
            ldlm.namespaces.mdt-fir-MDT0001_UUID.contention_seconds=2
            ldlm.namespaces.mdt-fir-MDT0001_UUID.ctime_age_limit=10
            ldlm.namespaces.mdt-fir-MDT0001_UUID.dirty_age_limit=10
            ldlm.namespaces.mdt-fir-MDT0001_UUID.early_lock_cancel=0
            ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_count=1126644
            ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_timeouts=44
            ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_unused_count=0
            ldlm.namespaces.mdt-fir-MDT0001_UUID.lru_max_age=3900000
            ldlm.namespaces.mdt-fir-MDT0001_UUID.lru_size=4800
            ldlm.namespaces.mdt-fir-MDT0001_UUID.max_nolock_bytes=0
            ldlm.namespaces.mdt-fir-MDT0001_UUID.max_parallel_ast=1024
            ldlm.namespaces.mdt-fir-MDT0001_UUID.resource_count=354248
            
            sthiell Stephane Thiell added a comment - As of tonight, the situation is stable, and we didn't have any new server crash (yay). So I guess the patch is definitively working. I can't thank you enough Patrick and Mike! The lock counters are still relatively high but there are ~6,300 jobs running at the moment and probably most of them using /fir so I assume this is normal: $ clush -w@mds -L lctl get_param ldlm.*.*mdt*.lock_count fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_count=439815 fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_count=241967 fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_count=1138179 fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0003_UUID.lock_count=623797 [root@fir-md1-s2 ~]# lctl get_param ldlm.namespaces.mdt-fir-MDT0001_UUID.* ldlm.namespaces.mdt-fir-MDT0001_UUID.contended_locks=32 ldlm.namespaces.mdt-fir-MDT0001_UUID.contention_seconds=2 ldlm.namespaces.mdt-fir-MDT0001_UUID.ctime_age_limit=10 ldlm.namespaces.mdt-fir-MDT0001_UUID.dirty_age_limit=10 ldlm.namespaces.mdt-fir-MDT0001_UUID.early_lock_cancel=0 ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_count=1126644 ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_timeouts=44 ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_unused_count=0 ldlm.namespaces.mdt-fir-MDT0001_UUID.lru_max_age=3900000 ldlm.namespaces.mdt-fir-MDT0001_UUID.lru_size=4800 ldlm.namespaces.mdt-fir-MDT0001_UUID.max_nolock_bytes=0 ldlm.namespaces.mdt-fir-MDT0001_UUID.max_parallel_ast=1024 ldlm.namespaces.mdt-fir-MDT0001_UUID.resource_count=354248

            At this point we have upgraded more than 1300 clients with the patch (more than 90% of the cluster) and I just tried to remove the 2-min lru_size=clear workaround on those.

            At first the lock counters increased a lot but I think they have stabilized below 1M per MDT:

            $ clush -w@mds -L lctl get_param ldlm.*.*mdt*.lock_count 
            fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_count=253634
            fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_count=241738
            fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_count=664659
            fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0003_UUID.lock_count=370532
            
            $ clush -w@mds -L lctl get_param ldlm.*.*mdt*.lock_count 
            fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_count=252612
            fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_count=239413
            fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_count=660803
            fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0003_UUID.lock_count=371448
            

            I'll confirm/update tomorrow.

            sthiell Stephane Thiell added a comment - At this point we have upgraded more than 1300 clients with the patch (more than 90% of the cluster) and I just tried to remove the 2-min lru_size=clear workaround on those. At first the lock counters increased a lot but I think they have stabilized below 1M per MDT: $ clush -w@mds -L lctl get_param ldlm.*.*mdt*.lock_count fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_count=253634 fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_count=241738 fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_count=664659 fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0003_UUID.lock_count=370532 $ clush -w@mds -L lctl get_param ldlm.*.*mdt*.lock_count fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_count=252612 fir-md1-s1: ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_count=239413 fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0001_UUID.lock_count=660803 fir-md1-s2: ldlm.namespaces.mdt-fir-MDT0003_UUID.lock_count=371448 I'll confirm/update tomorrow.

            We have started the deployment of clients with 2.12.0 + the patch above on Sherlock. It will take some time to get all nodes up to date (at least 2 weeks), but we'll progressively remove the lru_size=clear workaround on the updated nodes. Thanks!

            sthiell Stephane Thiell added a comment - We have started the deployment of clients with 2.12.0 + the patch above on Sherlock. It will take some time to get all nodes up to date (at least 2 weeks), but we'll progressively remove the lru_size=clear workaround on the updated nodes. Thanks!

            Ah great! Thanks Patrick, we'll work on it.

            sthiell Stephane Thiell added a comment - Ah great! Thanks Patrick, we'll work on it.

            Stephane,

            Patch is just client, and it has passed Maloo testing (including DOM).  I think it's ready to try.

            pfarrell Patrick Farrell (Inactive) added a comment - - edited Stephane, Patch is just client, and it has passed Maloo testing (including DOM).  I think it's ready to try.

            Quick update: we're still using the lru_size=clear workaround and the servers have been stable. However, it would be nice to try a patch at some point. It looks like the patch didn't pass the DOM tests yet? In any case, let me know when it is ready. Of course, patching clients takes a lot more of effort. Is the patch client + server?

            Thanks again,

            Stephane

            sthiell Stephane Thiell added a comment - Quick update: we're still using the lru_size=clear workaround and the servers have been stable. However, it would be nice to try a patch at some point. It looks like the patch didn't pass the DOM tests yet? In any case, let me know when it is ready. Of course, patching clients takes a lot more of effort. Is the patch client + server? Thanks again, Stephane

            Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34261
            Subject: LU-11964 mdc: prevent glimpse lock count grow
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2f98ba5cf94352be6028bd9ddf93fe1f5623b347

            gerrit Gerrit Updater added a comment - Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34261 Subject: LU-11964 mdc: prevent glimpse lock count grow Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2f98ba5cf94352be6028bd9ddf93fe1f5623b347

            That's great news, I'm so glad you were able to reproduce the issue locally. We can stay with the lru_size=clear workaround for the time being until the patch is ready.

            sthiell Stephane Thiell added a comment - That's great news, I'm so glad you were able to reproduce the issue locally. We can stay with the lru_size=clear workaround for the time being until the patch is ready.

            Stephane, thanks a lot, I think I am able to reproduce that behavior with quite simple test. It shows growing lock count without patch and no lock count increase with patch. I will push that patch to the gerrit as soon as local tests will end.

            tappro Mikhail Pershin added a comment - Stephane, thanks a lot, I think I am able to reproduce that behavior with quite simple test. It shows growing lock count without patch and no lock count increase with patch. I will push that patch to the gerrit as soon as local tests will end.

            Ah, actually fd 16 is not that file but another one.... Sorry.

            sthiell Stephane Thiell added a comment - Ah, actually fd 16 is not that file but another one.... Sorry.

            People

              tappro Mikhail Pershin
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: