Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.12.4
-
None
-
CentOS 7.6 (3.10.0-957.27.2.el7_lustre.pl2.x86_64)
-
3
-
9223372036854775807
Description
On Fir, with a Lustre 2.12.4 (vanilla), we hit a sort of deadlock on April 7, after 20 days of great runtime...
The first backtrace shows that kswapd is calling vmpressure/shrink_slab:
[1749118.074334] Lustre: fir-OST0030: Connection restored to f4363950-d6c3-4 (at 10.50.9.37@o2ib2) [1749118.083036] Lustre: Skipped 5 previous similar messages [1751518.726158] INFO: task kswapd0:276 blocked for more than 120 seconds. [1751518.732778] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1751518.740788] kswapd0 D ffff9fdbf9d74100 0 276 2 0x00000000 [1751518.748096] Call Trace: [1751518.750732] [<ffffffff9e2e367f>] ? enqueue_entity+0x2ef/0xbe0 [1751518.756778] [<ffffffff9e96af19>] schedule+0x29/0x70 [1751518.761950] [<ffffffffc0295085>] wait_transaction_locked+0x85/0xd0 [jbd2] [1751518.769035] [<ffffffff9e2c3f50>] ? wake_up_atomic_t+0x30/0x30 [1751518.775088] [<ffffffffc0295368>] add_transaction_credits+0x268/0x2f0 [jbd2] [1751518.782391] [<ffffffffc02955e1>] start_this_handle+0x1a1/0x430 [jbd2] [1751518.789096] [<ffffffff9e41d7f2>] ? kmem_cache_alloc+0x1c2/0x1f0 [1751518.795317] [<ffffffffc0295a93>] jbd2__journal_start+0xf3/0x1f0 [jbd2] [1751518.802119] [<ffffffffc1293c0c>] ? ldiskfs_release_dquot+0x5c/0xa0 [ldiskfs] [1751518.809470] [<ffffffffc129c589>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] [1751518.817072] [<ffffffffc1293c0c>] ldiskfs_release_dquot+0x5c/0xa0 [ldiskfs] [1751518.824232] [<ffffffff9e4afc1d>] dqput+0xbd/0x1f0 [1751518.829216] [<ffffffff9e4b12d5>] __dquot_drop+0x85/0xc0 [1751518.834723] [<ffffffff9e4b1345>] dquot_drop+0x35/0x40 [1751518.840067] [<ffffffffc1298602>] ldiskfs_clear_inode+0x22/0x90 [ldiskfs] [1751518.847034] [<ffffffffc127bf3f>] ldiskfs_evict_inode+0x9f/0x630 [ldiskfs] [1751518.854100] [<ffffffff9e45ff34>] evict+0xb4/0x180 [1751518.859095] [<ffffffff9e46003e>] dispose_list+0x3e/0x50 [1751518.864588] [<ffffffff9e46104c>] prune_icache_sb+0x17c/0x330 [1751518.870517] [<ffffffff9e4453b3>] prune_super+0x173/0x190 [1751518.876097] [<ffffffff9e3cb155>] shrink_slab+0x175/0x340 [1751518.881695] [<ffffffff9e438731>] ? vmpressure+0x21/0x90 [1751518.887205] [<ffffffff9e3cef51>] balance_pgdat+0x4b1/0x5e0 [1751518.892970] [<ffffffff9e3cf1f3>] kswapd+0x173/0x440 [1751518.898136] [<ffffffff9e2c3f50>] ? wake_up_atomic_t+0x30/0x30 [1751518.904172] [<ffffffff9e3cf080>] ? balance_pgdat+0x5e0/0x5e0 [1751518.910113] [<ffffffff9e2c2e81>] kthread+0xd1/0xe0 [1751518.915199] [<ffffffff9e2c2db0>] ? insert_kthread_work+0x40/0x40 [1751518.921484] [<ffffffff9e977c24>] ret_from_fork_nospec_begin+0xe/0x21 [1751518.928125] [<ffffffff9e2c2db0>] ? insert_kthread_work+0x40/0x40 [1751518.934559] INFO: task jbd2/md4-8:48176 blocked for more than 120 seconds.
I took a crash dump, which is available on the Whamcloud FTP as fir-io5-s1_20200407_vmcore with the following signature:
KERNEL: /usr/lib/debug/lib/modules/3.10.0-957.27.2.el7_lustre.pl2.x86_64/vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 48
DATE: Tue Apr 7 21:51:23 2020
UPTIME: 20 days, 08:21:02
LOAD AVERAGE: 918.58, 918.33, 900.42
TASKS: 2426
NODENAME: fir-io5-s1
RELEASE: 3.10.0-957.27.2.el7_lustre.pl2.x86_64
VERSION: #1 SMP Thu Nov 7 15:26:16 PST 2019
MACHINE: x86_64 (1996 Mhz)
MEMORY: 255.6 GB
PANIC: "SysRq : Trigger a crash"
PID: 96554
COMMAND: "bash"
TASK: ffff9ff7679c4100 [THREAD_INFO: ffff9ff9e4e74000]
CPU: 23
STATE: TASK_RUNNING (SYSRQ)
I'm also attaching the output of foreach bt for convenience as fir-io5-s1_20200407_vmcore-dmesg.txt
.
Could you help me figure out what went wrong here?
A restart of the OSS fixed the issue, even though the recovery took 15 minutes, but then later a MDS was identified as slow and issued a few backtraces. I'm preparing another ticket regarding this MDS issue. All servers are 2.12.4, clients on Sherlock are running 2.13.