Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11964

Heavy load and soft lockups on MDS with DOM

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.13.0, Lustre 2.12.1
    • Lustre 2.12.0
    • None
    • CentOS 7.6, Linux Kernel 3.10.0-957.1.3.el7_lustre.x86_64, Lustre 2.12 clients (Sherlock cluster)
    • 2
    • 9223372036854775807

    Description

      On Fir, our new scratch for Sherlock running 2.12, the filesystem hung today and then operations resumed. Fir is using DNE, PFL and DOM. The MDS hosting mdt[0,2] seems to have been under a heavy load. For example:

      [481509.742936] Call Trace:
      [481509.745520]  [<ffffffffc0f73303>] ldlm_process_inodebits_lock+0x93/0x3d0 [ptlrpc]
      [481509.753123]  [<ffffffffc0f61fc0>] ? ldlm_export_lock_object+0x10/0x10 [ptlrpc]
      [481509.760464]  [<ffffffffc0f73270>] ? ldlm_inodebits_compat_queue+0x440/0x440 [ptlrpc]
      [481509.768321]  [<ffffffffc0f4160e>] ldlm_reprocess_queue+0x1be/0x3f0 [ptlrpc]
      [481509.775403]  [<ffffffffc0f426cb>] __ldlm_reprocess_all+0x10b/0x380 [ptlrpc]
      [481509.782482]  [<ffffffffc0f434a2>] ldlm_cancel_lock_for_export.isra.26+0x1c2/0x390 [ptlrpc]
      [481509.790863]  [<ffffffffc0f43851>] ldlm_export_cancel_blocked_locks+0x121/0x200 [ptlrpc]
      [481509.798987]  [<ffffffffc0f6c6e2>] ldlm_bl_thread_main+0x112/0x700 [ptlrpc]
      [481509.805950]  [<ffffffff9d6d67b0>] ? wake_up_state+0x20/0x20
      [481509.811650]  [<ffffffffc0f6c5d0>] ? ldlm_handle_bl_callback+0x530/0x530 [ptlrpc]
      [481509.819130]  [<ffffffff9d6c1c31>] kthread+0xd1/0xe0
      [481509.824095]  [<ffffffff9d6c1b60>] ? insert_kthread_work+0x40/0x40
      [481509.830277]  [<ffffffff9dd74c24>] ret_from_fork_nospec_begin+0xe/0x21
      [481509.836799]  [<ffffffff9d6c1b60>] ? insert_kthread_work+0x40/0x40
      [481509.842979] Code: 74 0e 4c 89 f2 48 89 de 4c 89 ff e8 d3 a2 fc ff 49 8b 87 10 02 00 00 4c 8d a8 f0 fd ff ff 4d 39 fd 74 2b 49 83 bd a8 00 00 00 00 <74> 0e 4c 89 f2 48 89 de 4c 89 ef e8 a8 a2 fc ff 4d 8b ad 10 02 
      [481516.050304] INFO: rcu_sched self-detected stall on CPU
      [481516.051302] INFO: rcu_sched self-detected stall on CPU
      [481516.051303]  {
      [481516.051303]  15
      [481516.051304] } 
      [481516.051314]  (t=60001 jiffies g=115591743 c=115591742 q=6728165)
      [481516.051315] Task dump for CPU 14:
      [481516.051316] ldlm_bl_01      R
      [481516.051316]   running task    
      [481516.051318]     0 54053      2 0x00000088
      [481516.051318] Call Trace:
      [481516.051362]  [<ffffffffc0f43851>] ? ldlm_export_cancel_blocked_locks+0x121/0x200 [ptlrpc]
      [481516.051394]  [<ffffffffc0f6c6e2>] ? ldlm_bl_thread_main+0x112/0x700 [ptlrpc]
      [481516.051397]  [<ffffffff9d6d67b0>] ? wake_up_state+0x20/0x20
      [481516.051426]  [<ffffffffc0f6c5d0>] ? ldlm_handle_bl_callback+0x530/0x530 [ptlrpc]
      [481516.051427]  [<ffffffff9d6c1c31>] ? kthread+0xd1/0xe0
      [481516.051429]  [<ffffffff9d6c1b60>] ? insert_kthread_work+0x40/0x40
      [481516.051431]  [<ffffffff9dd74c24>] ? ret_from_fork_nospec_begin+0xe/0x21
      [481516.051432]  [<ffffffff9d6c1b60>] ? insert_kthread_work+0x40/0x40
      [481516.051432] Task dump for CPU 15:
      

      filesystem is still running but some threads are at 100% on the MDS

         PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                      
       54053 root      20   0       0      0      0 R 100.0  0.0   2:02.22 ldlm_bl_01                                                                                                                                   
       55962 root      20   0       0      0      0 R 100.0  0.0   2:01.79 ldlm_bl_03                                                                                                                                   
       56043 root      20   0       0      0      0 R 100.0  0.0   3:20.41 ldlm_bl_04                                                                                                                                   
       63820 root      20   0       0      0      0 R 100.0  0.0  10:13.65 mdt03_033  
      
      [root@fir-md1-s1 ~]# lctl get_param ldlm.*.*mdt*.*
      ldlm.namespaces.mdt-fir-MDT0000_UUID.contended_locks=32
      ldlm.namespaces.mdt-fir-MDT0000_UUID.contention_seconds=2
      ldlm.namespaces.mdt-fir-MDT0000_UUID.ctime_age_limit=10
      ldlm.namespaces.mdt-fir-MDT0000_UUID.dirty_age_limit=10
      ldlm.namespaces.mdt-fir-MDT0000_UUID.early_lock_cancel=0
      ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_count=146645
      ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_timeouts=8
      ldlm.namespaces.mdt-fir-MDT0000_UUID.lock_unused_count=0
      ldlm.namespaces.mdt-fir-MDT0000_UUID.lru_max_age=3900000
      ldlm.namespaces.mdt-fir-MDT0000_UUID.lru_size=4800
      ldlm.namespaces.mdt-fir-MDT0000_UUID.max_nolock_bytes=0
      ldlm.namespaces.mdt-fir-MDT0000_UUID.max_parallel_ast=1024
      ldlm.namespaces.mdt-fir-MDT0000_UUID.resource_count=62361
      ldlm.namespaces.mdt-fir-MDT0002_UUID.contended_locks=32
      ldlm.namespaces.mdt-fir-MDT0002_UUID.contention_seconds=2
      ldlm.namespaces.mdt-fir-MDT0002_UUID.ctime_age_limit=10
      ldlm.namespaces.mdt-fir-MDT0002_UUID.dirty_age_limit=10
      ldlm.namespaces.mdt-fir-MDT0002_UUID.early_lock_cancel=0
      ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_count=6472231
      ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_timeouts=166760
      ldlm.namespaces.mdt-fir-MDT0002_UUID.lock_unused_count=0
      ldlm.namespaces.mdt-fir-MDT0002_UUID.lru_max_age=3900000
      ldlm.namespaces.mdt-fir-MDT0002_UUID.lru_size=4800
      ldlm.namespaces.mdt-fir-MDT0002_UUID.max_nolock_bytes=0
      ldlm.namespaces.mdt-fir-MDT0002_UUID.max_parallel_ast=1024
      ldlm.namespaces.mdt-fir-MDT0002_UUID.resource_count=359599
      

      MDS resumed operations by itself, the last logs were:

      [482208.719700] LustreError: 63820:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1549999197, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-fir-MDT0002_UUID lock: ffff8995f523f740/0x1a4b7ac741771fee lrc: 3/0,1 mode: --/PW res: [0x2c00033c7:0x2b6e:0x0].0x0 bits 0x40/0x0 rrc: 39993 type: IBT flags: 0x40010080000000 nid: local remote: 0x0 expref: -99 pid: 63820 timeout: 0 lvb_type: 0
      [482361.725697] LNet: Service thread pid 63820 completed after 1225.53s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [482666.099950] Lustre: fir-MDT0002: Client 0ccdc4e2-9749-c9a5-afb4-85874ce74d6c (at 10.0.10.3@o2ib7) reconnecting
      

      I'm attaching the full syslog output.

      Now everything looks good even though this MDS is still quite loaded in terms of CPU.

      Thanks,
      Stephane

      Attachments

        1. fir-md1-s1.syslog-20190212.log.gz
          303 kB
        2. fir-md1-s1-20190213.log
          82 kB
        3. lustre-log.1550078704.21794
          39 kB
        4. fir-md1-s1-dk.log
          1.27 MB
        5. sh-106-60.dk.log.gz
          4.11 MB
        6. fir-md1-s2-dk.log
          7.15 MB
        7. sherlocks.txt
          17 kB

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: