Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.12.4
Labels:
None
Environment:
CentOS 7.6 (3.10.0-957.27.2.el7_lustre.pl2.x86_64)

Severity:
3
Rank (Obsolete):
9223372036854775807

On Fir, with a Lustre 2.12.4 (vanilla), we hit a sort of deadlock on April 7, after 20 days of great runtime...

The first backtrace shows that kswapd is calling vmpressure/shrink_slab:

[1749118.074334] Lustre: fir-OST0030: Connection restored to f4363950-d6c3-4 (at 10.50.9.37@o2ib2)
[1749118.083036] Lustre: Skipped 5 previous similar messages
[1751518.726158] INFO: task kswapd0:276 blocked for more than 120 seconds.
[1751518.732778] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1751518.740788] kswapd0         D ffff9fdbf9d74100     0   276      2 0x00000000
[1751518.748096] Call Trace:
[1751518.750732]  [<ffffffff9e2e367f>] ? enqueue_entity+0x2ef/0xbe0
[1751518.756778]  [<ffffffff9e96af19>] schedule+0x29/0x70
[1751518.761950]  [<ffffffffc0295085>] wait_transaction_locked+0x85/0xd0 [jbd2]
[1751518.769035]  [<ffffffff9e2c3f50>] ? wake_up_atomic_t+0x30/0x30
[1751518.775088]  [<ffffffffc0295368>] add_transaction_credits+0x268/0x2f0 [jbd2]
[1751518.782391]  [<ffffffffc02955e1>] start_this_handle+0x1a1/0x430 [jbd2]
[1751518.789096]  [<ffffffff9e41d7f2>] ? kmem_cache_alloc+0x1c2/0x1f0
[1751518.795317]  [<ffffffffc0295a93>] jbd2__journal_start+0xf3/0x1f0 [jbd2]
[1751518.802119]  [<ffffffffc1293c0c>] ? ldiskfs_release_dquot+0x5c/0xa0 [ldiskfs]
[1751518.809470]  [<ffffffffc129c589>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
[1751518.817072]  [<ffffffffc1293c0c>] ldiskfs_release_dquot+0x5c/0xa0 [ldiskfs]
[1751518.824232]  [<ffffffff9e4afc1d>] dqput+0xbd/0x1f0
[1751518.829216]  [<ffffffff9e4b12d5>] __dquot_drop+0x85/0xc0
[1751518.834723]  [<ffffffff9e4b1345>] dquot_drop+0x35/0x40
[1751518.840067]  [<ffffffffc1298602>] ldiskfs_clear_inode+0x22/0x90 [ldiskfs]
[1751518.847034]  [<ffffffffc127bf3f>] ldiskfs_evict_inode+0x9f/0x630 [ldiskfs]
[1751518.854100]  [<ffffffff9e45ff34>] evict+0xb4/0x180
[1751518.859095]  [<ffffffff9e46003e>] dispose_list+0x3e/0x50
[1751518.864588]  [<ffffffff9e46104c>] prune_icache_sb+0x17c/0x330
[1751518.870517]  [<ffffffff9e4453b3>] prune_super+0x173/0x190
[1751518.876097]  [<ffffffff9e3cb155>] shrink_slab+0x175/0x340
[1751518.881695]  [<ffffffff9e438731>] ? vmpressure+0x21/0x90
[1751518.887205]  [<ffffffff9e3cef51>] balance_pgdat+0x4b1/0x5e0
[1751518.892970]  [<ffffffff9e3cf1f3>] kswapd+0x173/0x440
[1751518.898136]  [<ffffffff9e2c3f50>] ? wake_up_atomic_t+0x30/0x30
[1751518.904172]  [<ffffffff9e3cf080>] ? balance_pgdat+0x5e0/0x5e0
[1751518.910113]  [<ffffffff9e2c2e81>] kthread+0xd1/0xe0
[1751518.915199]  [<ffffffff9e2c2db0>] ? insert_kthread_work+0x40/0x40
[1751518.921484]  [<ffffffff9e977c24>] ret_from_fork_nospec_begin+0xe/0x21
[1751518.928125]  [<ffffffff9e2c2db0>] ? insert_kthread_work+0x40/0x40
[1751518.934559] INFO: task jbd2/md4-8:48176 blocked for more than 120 seconds.

I took a crash dump, which is available on the Whamcloud FTP as fir-io5-s1_20200407_vmcore with the following signature:

      KERNEL: /usr/lib/debug/lib/modules/3.10.0-957.27.2.el7_lustre.pl2.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 48
        DATE: Tue Apr  7 21:51:23 2020
      UPTIME: 20 days, 08:21:02
LOAD AVERAGE: 918.58, 918.33, 900.42
       TASKS: 2426
    NODENAME: fir-io5-s1
     RELEASE: 3.10.0-957.27.2.el7_lustre.pl2.x86_64
     VERSION: #1 SMP Thu Nov 7 15:26:16 PST 2019
     MACHINE: x86_64  (1996 Mhz)
      MEMORY: 255.6 GB
       PANIC: "SysRq : Trigger a crash"
         PID: 96554
     COMMAND: "bash"
        TASK: ffff9ff7679c4100  [THREAD_INFO: ffff9ff9e4e74000]
         CPU: 23
       STATE: TASK_RUNNING (SYSRQ)

I'm also attaching the output of foreach bt for convenience as fir-io5-s1_20200407_vmcore-dmesg.txt.

Could you help me figure out what went wrong here?

A restart of the OSS fixed the issue, even though the recovery took 15 minutes, but then later a MDS was identified as slow and issued a few backtraces. I'm preparing another ticket regarding this MDS issue. All servers are 2.12.4, clients on Sherlock are running 2.13.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

fir-io5-s1_20200407_foreach_bt.txt
1.81 MB
09/Apr/20 7:32 PM
fir-io5-s1_20200407_vmcore-dmesg.txt
925 kB
09/Apr/20 7:28 PM

Assignee:: Dongyang Li

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 09/Apr/20 7:30 PM

Updated:: 11/Jun/26 9:25 AM

Resolved:: 11/Jun/26 9:25 AM

Details

Description

Attachments

Attachments

Activity

People

Dates