[LU-13442] OSS high load and deadlock after 20 days of run time Created: 09/Apr/20 Updated: 20/Apr/20 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Stephane Thiell | Assignee: | Dongyang Li |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.6 (3.10.0-957.27.2.el7_lustre.pl2.x86_64) |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
On Fir, with a Lustre 2.12.4 (vanilla), we hit a sort of deadlock on April 7, after 20 days of great runtime... The first backtrace shows that kswapd is calling vmpressure/shrink_slab: [1749118.074334] Lustre: fir-OST0030: Connection restored to f4363950-d6c3-4 (at 10.50.9.37@o2ib2) [1749118.083036] Lustre: Skipped 5 previous similar messages [1751518.726158] INFO: task kswapd0:276 blocked for more than 120 seconds. [1751518.732778] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1751518.740788] kswapd0 D ffff9fdbf9d74100 0 276 2 0x00000000 [1751518.748096] Call Trace: [1751518.750732] [<ffffffff9e2e367f>] ? enqueue_entity+0x2ef/0xbe0 [1751518.756778] [<ffffffff9e96af19>] schedule+0x29/0x70 [1751518.761950] [<ffffffffc0295085>] wait_transaction_locked+0x85/0xd0 [jbd2] [1751518.769035] [<ffffffff9e2c3f50>] ? wake_up_atomic_t+0x30/0x30 [1751518.775088] [<ffffffffc0295368>] add_transaction_credits+0x268/0x2f0 [jbd2] [1751518.782391] [<ffffffffc02955e1>] start_this_handle+0x1a1/0x430 [jbd2] [1751518.789096] [<ffffffff9e41d7f2>] ? kmem_cache_alloc+0x1c2/0x1f0 [1751518.795317] [<ffffffffc0295a93>] jbd2__journal_start+0xf3/0x1f0 [jbd2] [1751518.802119] [<ffffffffc1293c0c>] ? ldiskfs_release_dquot+0x5c/0xa0 [ldiskfs] [1751518.809470] [<ffffffffc129c589>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs] [1751518.817072] [<ffffffffc1293c0c>] ldiskfs_release_dquot+0x5c/0xa0 [ldiskfs] [1751518.824232] [<ffffffff9e4afc1d>] dqput+0xbd/0x1f0 [1751518.829216] [<ffffffff9e4b12d5>] __dquot_drop+0x85/0xc0 [1751518.834723] [<ffffffff9e4b1345>] dquot_drop+0x35/0x40 [1751518.840067] [<ffffffffc1298602>] ldiskfs_clear_inode+0x22/0x90 [ldiskfs] [1751518.847034] [<ffffffffc127bf3f>] ldiskfs_evict_inode+0x9f/0x630 [ldiskfs] [1751518.854100] [<ffffffff9e45ff34>] evict+0xb4/0x180 [1751518.859095] [<ffffffff9e46003e>] dispose_list+0x3e/0x50 [1751518.864588] [<ffffffff9e46104c>] prune_icache_sb+0x17c/0x330 [1751518.870517] [<ffffffff9e4453b3>] prune_super+0x173/0x190 [1751518.876097] [<ffffffff9e3cb155>] shrink_slab+0x175/0x340 [1751518.881695] [<ffffffff9e438731>] ? vmpressure+0x21/0x90 [1751518.887205] [<ffffffff9e3cef51>] balance_pgdat+0x4b1/0x5e0 [1751518.892970] [<ffffffff9e3cf1f3>] kswapd+0x173/0x440 [1751518.898136] [<ffffffff9e2c3f50>] ? wake_up_atomic_t+0x30/0x30 [1751518.904172] [<ffffffff9e3cf080>] ? balance_pgdat+0x5e0/0x5e0 [1751518.910113] [<ffffffff9e2c2e81>] kthread+0xd1/0xe0 [1751518.915199] [<ffffffff9e2c2db0>] ? insert_kthread_work+0x40/0x40 [1751518.921484] [<ffffffff9e977c24>] ret_from_fork_nospec_begin+0xe/0x21 [1751518.928125] [<ffffffff9e2c2db0>] ? insert_kthread_work+0x40/0x40 [1751518.934559] INFO: task jbd2/md4-8:48176 blocked for more than 120 seconds. I took a crash dump, which is available on the Whamcloud FTP as fir-io5-s1_20200407_vmcore with the following signature: KERNEL: /usr/lib/debug/lib/modules/3.10.0-957.27.2.el7_lustre.pl2.x86_64/vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 48
DATE: Tue Apr 7 21:51:23 2020
UPTIME: 20 days, 08:21:02
LOAD AVERAGE: 918.58, 918.33, 900.42
TASKS: 2426
NODENAME: fir-io5-s1
RELEASE: 3.10.0-957.27.2.el7_lustre.pl2.x86_64
VERSION: #1 SMP Thu Nov 7 15:26:16 PST 2019
MACHINE: x86_64 (1996 Mhz)
MEMORY: 255.6 GB
PANIC: "SysRq : Trigger a crash"
PID: 96554
COMMAND: "bash"
TASK: ffff9ff7679c4100 [THREAD_INFO: ffff9ff9e4e74000]
CPU: 23
STATE: TASK_RUNNING (SYSRQ)
I'm also attaching the output of foreach bt for convenience as fir-io5-s1_20200407_vmcore-dmesg.txt Could you help me figure out what went wrong here? A restart of the OSS fixed the issue, even though the recovery took 15 minutes, but then later a MDS was identified as slow and issued a few backtraces. I'm preparing another ticket regarding this MDS issue. All servers are 2.12.4, clients on Sherlock are running 2.13. |
| Comments |
| Comment by James Nunez (Inactive) [ 10/Apr/20 ] |
|
Dongyang - Would you please look into this issue? |
| Comment by Dongyang Li [ 16/Apr/20 ] |
PID: 48944 TASK: ffff9fcc2c516180 CPU: 17 COMMAND: "ll_ost_io01_008"
#0 [ffff9fdbdfe9f498] __schedule at ffffffff9e96aa72
#1 [ffff9fdbdfe9f528] schedule at ffffffff9e96af19
#2 [ffff9fdbdfe9f538] bitmap_startwrite at ffffffff9e79cc65
#3 [ffff9fdbdfe9f5b8] add_stripe_bio at ffffffffc1305471 [raid456]
#4 [ffff9fdbdfe9f618] raid5_make_request at ffffffffc130adb4 [raid456]
#5 [ffff9fdbdfe9f700] md_handle_request at ffffffff9e78fa30
#6 [ffff9fdbdfe9f770] md_make_request at ffffffff9e78fb99
#7 [ffff9fdbdfe9f798] generic_make_request at ffffffff9e546e37
#8 [ffff9fdbdfe9f7e8] submit_bio at ffffffff9e5470e0
#9 [ffff9fdbdfe9f840] osd_submit_bio at ffffffffc13a7b4c [osd_ldiskfs]
#10 [ffff9fdbdfe9f850] osd_do_bio at ffffffffc13aa9a0 [osd_ldiskfs]
#11 [ffff9fdbdfe9f9d0] osd_write_commit at ffffffffc13ab89c [osd_ldiskfs]
#12 [ffff9fdbdfe9fa50] ofd_commitrw_write at ffffffffc14f4c8e [ofd]
#13 [ffff9fdbdfe9fad8] ofd_commitrw at ffffffffc14f8c6c [ofd]
#14 [ffff9fdbdfe9fb58] tgt_brw_write at ffffffffc0f68ddd [ptlrpc]
#15 [ffff9fdbdfe9fcd0] tgt_request_handle at ffffffffc0f6d64a [ptlrpc]
#16 [ffff9fdbdfe9fd58] ptlrpc_server_handle_request at ffffffffc0f1043b [ptlrpc]
#17 [ffff9fdbdfe9fdf8] ptlrpc_main at ffffffffc0f13da4 [ptlrpc]
#18 [ffff9fdbdfe9fec8] kthread at ffffffff9e2c2e81
#19 [ffff9fdbdfe9ff50] ret_from_fork_nospec_begin at ffffffff9e977c24
for the thread we already holding a jbd2 transaction but it can not be completed because it's stuck in writing the raid5/md bitmap, looks like it's waiting on bitmap->overflow_wait and not waking up? so every other thread trying to get a new jbd2 trans handle had to wait for it, including the kswapd thread you mentioned. |
| Comment by Stephane Thiell [ 16/Apr/20 ] |
|
Hi Dongyang, Many thanks for your help! So I understand this could be a Linux mdraid problem. I found this linux-raid thread that looks similar: |
| Comment by Dongyang Li [ 17/Apr/20 ] |
|
Correct it's more like a md problem. the linux-raid thread is about, raid1 similar but not exactly the same, the follow up also mentioned there could be faulty devices, but I didn't see anything related from your logs. |
| Comment by Stephane Thiell [ 20/Apr/20 ] |
|
I agree, it's not exactly the same problem nor the same conditions, but there is one similarity that caught my attention: the endless waiting on overflow_wait. Because we're using CentOS, I don't think there is an efficient way to report this upstream though. Let's just hope this won't happen very often (so far, we've seen only one occurrence). |