Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13442

OSS high load and deadlock after 20 days of run time

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.12.4
    • None
    • CentOS 7.6 (3.10.0-957.27.2.el7_lustre.pl2.x86_64)
    • 3
    • 9223372036854775807

    Description

      On Fir, with a Lustre 2.12.4 (vanilla), we hit a sort of deadlock on April 7, after 20 days of great runtime...

      The first backtrace shows that kswapd is calling vmpressure/shrink_slab:

      [1749118.074334] Lustre: fir-OST0030: Connection restored to f4363950-d6c3-4 (at 10.50.9.37@o2ib2)
      [1749118.083036] Lustre: Skipped 5 previous similar messages
      [1751518.726158] INFO: task kswapd0:276 blocked for more than 120 seconds.
      [1751518.732778] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [1751518.740788] kswapd0         D ffff9fdbf9d74100     0   276      2 0x00000000
      [1751518.748096] Call Trace:
      [1751518.750732]  [<ffffffff9e2e367f>] ? enqueue_entity+0x2ef/0xbe0
      [1751518.756778]  [<ffffffff9e96af19>] schedule+0x29/0x70
      [1751518.761950]  [<ffffffffc0295085>] wait_transaction_locked+0x85/0xd0 [jbd2]
      [1751518.769035]  [<ffffffff9e2c3f50>] ? wake_up_atomic_t+0x30/0x30
      [1751518.775088]  [<ffffffffc0295368>] add_transaction_credits+0x268/0x2f0 [jbd2]
      [1751518.782391]  [<ffffffffc02955e1>] start_this_handle+0x1a1/0x430 [jbd2]
      [1751518.789096]  [<ffffffff9e41d7f2>] ? kmem_cache_alloc+0x1c2/0x1f0
      [1751518.795317]  [<ffffffffc0295a93>] jbd2__journal_start+0xf3/0x1f0 [jbd2]
      [1751518.802119]  [<ffffffffc1293c0c>] ? ldiskfs_release_dquot+0x5c/0xa0 [ldiskfs]
      [1751518.809470]  [<ffffffffc129c589>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
      [1751518.817072]  [<ffffffffc1293c0c>] ldiskfs_release_dquot+0x5c/0xa0 [ldiskfs]
      [1751518.824232]  [<ffffffff9e4afc1d>] dqput+0xbd/0x1f0
      [1751518.829216]  [<ffffffff9e4b12d5>] __dquot_drop+0x85/0xc0
      [1751518.834723]  [<ffffffff9e4b1345>] dquot_drop+0x35/0x40
      [1751518.840067]  [<ffffffffc1298602>] ldiskfs_clear_inode+0x22/0x90 [ldiskfs]
      [1751518.847034]  [<ffffffffc127bf3f>] ldiskfs_evict_inode+0x9f/0x630 [ldiskfs]
      [1751518.854100]  [<ffffffff9e45ff34>] evict+0xb4/0x180
      [1751518.859095]  [<ffffffff9e46003e>] dispose_list+0x3e/0x50
      [1751518.864588]  [<ffffffff9e46104c>] prune_icache_sb+0x17c/0x330
      [1751518.870517]  [<ffffffff9e4453b3>] prune_super+0x173/0x190
      [1751518.876097]  [<ffffffff9e3cb155>] shrink_slab+0x175/0x340
      [1751518.881695]  [<ffffffff9e438731>] ? vmpressure+0x21/0x90
      [1751518.887205]  [<ffffffff9e3cef51>] balance_pgdat+0x4b1/0x5e0
      [1751518.892970]  [<ffffffff9e3cf1f3>] kswapd+0x173/0x440
      [1751518.898136]  [<ffffffff9e2c3f50>] ? wake_up_atomic_t+0x30/0x30
      [1751518.904172]  [<ffffffff9e3cf080>] ? balance_pgdat+0x5e0/0x5e0
      [1751518.910113]  [<ffffffff9e2c2e81>] kthread+0xd1/0xe0
      [1751518.915199]  [<ffffffff9e2c2db0>] ? insert_kthread_work+0x40/0x40
      [1751518.921484]  [<ffffffff9e977c24>] ret_from_fork_nospec_begin+0xe/0x21
      [1751518.928125]  [<ffffffff9e2c2db0>] ? insert_kthread_work+0x40/0x40
      [1751518.934559] INFO: task jbd2/md4-8:48176 blocked for more than 120 seconds.
      

      I took a crash dump, which is available on the Whamcloud FTP as fir-io5-s1_20200407_vmcore with the following signature:

            KERNEL: /usr/lib/debug/lib/modules/3.10.0-957.27.2.el7_lustre.pl2.x86_64/vmlinux
          DUMPFILE: vmcore  [PARTIAL DUMP]
              CPUS: 48
              DATE: Tue Apr  7 21:51:23 2020
            UPTIME: 20 days, 08:21:02
      LOAD AVERAGE: 918.58, 918.33, 900.42
             TASKS: 2426
          NODENAME: fir-io5-s1
           RELEASE: 3.10.0-957.27.2.el7_lustre.pl2.x86_64
           VERSION: #1 SMP Thu Nov 7 15:26:16 PST 2019
           MACHINE: x86_64  (1996 Mhz)
            MEMORY: 255.6 GB
             PANIC: "SysRq : Trigger a crash"
               PID: 96554
           COMMAND: "bash"
              TASK: ffff9ff7679c4100  [THREAD_INFO: ffff9ff9e4e74000]
               CPU: 23
             STATE: TASK_RUNNING (SYSRQ)
      

      I'm also attaching the output of foreach bt for convenience as fir-io5-s1_20200407_vmcore-dmesg.txt.

      Could you help me figure out what went wrong here?

      A restart of the OSS fixed the issue, even though the recovery took 15 minutes, but then later a MDS was identified as slow and issued a few backtraces. I'm preparing another ticket regarding this MDS issue. All servers are 2.12.4, clients on Sherlock are running 2.13.

      Attachments

        Activity

          People

            dongyang Dongyang Li
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: