[LU-13442] OSS high load and deadlock after 20 days of run time Created: 09/Apr/20  Updated: 20/Apr/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Stephane Thiell Assignee: Dongyang Li
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7.6 (3.10.0-957.27.2.el7_lustre.pl2.x86_64)


Attachments: Text File fir-io5-s1_20200407_foreach_bt.txt     Text File fir-io5-s1_20200407_vmcore-dmesg.txt    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

On Fir, with a Lustre 2.12.4 (vanilla), we hit a sort of deadlock on April 7, after 20 days of great runtime...

The first backtrace shows that kswapd is calling vmpressure/shrink_slab:

[1749118.074334] Lustre: fir-OST0030: Connection restored to f4363950-d6c3-4 (at 10.50.9.37@o2ib2)
[1749118.083036] Lustre: Skipped 5 previous similar messages
[1751518.726158] INFO: task kswapd0:276 blocked for more than 120 seconds.
[1751518.732778] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1751518.740788] kswapd0         D ffff9fdbf9d74100     0   276      2 0x00000000
[1751518.748096] Call Trace:
[1751518.750732]  [<ffffffff9e2e367f>] ? enqueue_entity+0x2ef/0xbe0
[1751518.756778]  [<ffffffff9e96af19>] schedule+0x29/0x70
[1751518.761950]  [<ffffffffc0295085>] wait_transaction_locked+0x85/0xd0 [jbd2]
[1751518.769035]  [<ffffffff9e2c3f50>] ? wake_up_atomic_t+0x30/0x30
[1751518.775088]  [<ffffffffc0295368>] add_transaction_credits+0x268/0x2f0 [jbd2]
[1751518.782391]  [<ffffffffc02955e1>] start_this_handle+0x1a1/0x430 [jbd2]
[1751518.789096]  [<ffffffff9e41d7f2>] ? kmem_cache_alloc+0x1c2/0x1f0
[1751518.795317]  [<ffffffffc0295a93>] jbd2__journal_start+0xf3/0x1f0 [jbd2]
[1751518.802119]  [<ffffffffc1293c0c>] ? ldiskfs_release_dquot+0x5c/0xa0 [ldiskfs]
[1751518.809470]  [<ffffffffc129c589>] __ldiskfs_journal_start_sb+0x69/0xe0 [ldiskfs]
[1751518.817072]  [<ffffffffc1293c0c>] ldiskfs_release_dquot+0x5c/0xa0 [ldiskfs]
[1751518.824232]  [<ffffffff9e4afc1d>] dqput+0xbd/0x1f0
[1751518.829216]  [<ffffffff9e4b12d5>] __dquot_drop+0x85/0xc0
[1751518.834723]  [<ffffffff9e4b1345>] dquot_drop+0x35/0x40
[1751518.840067]  [<ffffffffc1298602>] ldiskfs_clear_inode+0x22/0x90 [ldiskfs]
[1751518.847034]  [<ffffffffc127bf3f>] ldiskfs_evict_inode+0x9f/0x630 [ldiskfs]
[1751518.854100]  [<ffffffff9e45ff34>] evict+0xb4/0x180
[1751518.859095]  [<ffffffff9e46003e>] dispose_list+0x3e/0x50
[1751518.864588]  [<ffffffff9e46104c>] prune_icache_sb+0x17c/0x330
[1751518.870517]  [<ffffffff9e4453b3>] prune_super+0x173/0x190
[1751518.876097]  [<ffffffff9e3cb155>] shrink_slab+0x175/0x340
[1751518.881695]  [<ffffffff9e438731>] ? vmpressure+0x21/0x90
[1751518.887205]  [<ffffffff9e3cef51>] balance_pgdat+0x4b1/0x5e0
[1751518.892970]  [<ffffffff9e3cf1f3>] kswapd+0x173/0x440
[1751518.898136]  [<ffffffff9e2c3f50>] ? wake_up_atomic_t+0x30/0x30
[1751518.904172]  [<ffffffff9e3cf080>] ? balance_pgdat+0x5e0/0x5e0
[1751518.910113]  [<ffffffff9e2c2e81>] kthread+0xd1/0xe0
[1751518.915199]  [<ffffffff9e2c2db0>] ? insert_kthread_work+0x40/0x40
[1751518.921484]  [<ffffffff9e977c24>] ret_from_fork_nospec_begin+0xe/0x21
[1751518.928125]  [<ffffffff9e2c2db0>] ? insert_kthread_work+0x40/0x40
[1751518.934559] INFO: task jbd2/md4-8:48176 blocked for more than 120 seconds.

I took a crash dump, which is available on the Whamcloud FTP as fir-io5-s1_20200407_vmcore with the following signature:

      KERNEL: /usr/lib/debug/lib/modules/3.10.0-957.27.2.el7_lustre.pl2.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 48
        DATE: Tue Apr  7 21:51:23 2020
      UPTIME: 20 days, 08:21:02
LOAD AVERAGE: 918.58, 918.33, 900.42
       TASKS: 2426
    NODENAME: fir-io5-s1
     RELEASE: 3.10.0-957.27.2.el7_lustre.pl2.x86_64
     VERSION: #1 SMP Thu Nov 7 15:26:16 PST 2019
     MACHINE: x86_64  (1996 Mhz)
      MEMORY: 255.6 GB
       PANIC: "SysRq : Trigger a crash"
         PID: 96554
     COMMAND: "bash"
        TASK: ffff9ff7679c4100  [THREAD_INFO: ffff9ff9e4e74000]
         CPU: 23
       STATE: TASK_RUNNING (SYSRQ)

I'm also attaching the output of foreach bt for convenience as fir-io5-s1_20200407_vmcore-dmesg.txt.

Could you help me figure out what went wrong here?

A restart of the OSS fixed the issue, even though the recovery took 15 minutes, but then later a MDS was identified as slow and issued a few backtraces. I'm preparing another ticket regarding this MDS issue. All servers are 2.12.4, clients on Sherlock are running 2.13.



 Comments   
Comment by James Nunez (Inactive) [ 10/Apr/20 ]

Dongyang - Would you please look into this issue?

Comment by Dongyang Li [ 16/Apr/20 ]
PID: 48944  TASK: ffff9fcc2c516180  CPU: 17  COMMAND: "ll_ost_io01_008"
 #0 [ffff9fdbdfe9f498] __schedule at ffffffff9e96aa72
 #1 [ffff9fdbdfe9f528] schedule at ffffffff9e96af19
 #2 [ffff9fdbdfe9f538] bitmap_startwrite at ffffffff9e79cc65
 #3 [ffff9fdbdfe9f5b8] add_stripe_bio at ffffffffc1305471 [raid456]
 #4 [ffff9fdbdfe9f618] raid5_make_request at ffffffffc130adb4 [raid456]
 #5 [ffff9fdbdfe9f700] md_handle_request at ffffffff9e78fa30
 #6 [ffff9fdbdfe9f770] md_make_request at ffffffff9e78fb99
 #7 [ffff9fdbdfe9f798] generic_make_request at ffffffff9e546e37
 #8 [ffff9fdbdfe9f7e8] submit_bio at ffffffff9e5470e0
 #9 [ffff9fdbdfe9f840] osd_submit_bio at ffffffffc13a7b4c [osd_ldiskfs]
#10 [ffff9fdbdfe9f850] osd_do_bio at ffffffffc13aa9a0 [osd_ldiskfs]
#11 [ffff9fdbdfe9f9d0] osd_write_commit at ffffffffc13ab89c [osd_ldiskfs]
#12 [ffff9fdbdfe9fa50] ofd_commitrw_write at ffffffffc14f4c8e [ofd]
#13 [ffff9fdbdfe9fad8] ofd_commitrw at ffffffffc14f8c6c [ofd]
#14 [ffff9fdbdfe9fb58] tgt_brw_write at ffffffffc0f68ddd [ptlrpc]
#15 [ffff9fdbdfe9fcd0] tgt_request_handle at ffffffffc0f6d64a [ptlrpc]
#16 [ffff9fdbdfe9fd58] ptlrpc_server_handle_request at ffffffffc0f1043b [ptlrpc]
#17 [ffff9fdbdfe9fdf8] ptlrpc_main at ffffffffc0f13da4 [ptlrpc]
#18 [ffff9fdbdfe9fec8] kthread at ffffffff9e2c2e81
#19 [ffff9fdbdfe9ff50] ret_from_fork_nospec_begin at ffffffff9e977c24

for the thread we already holding a jbd2 transaction but it can not be completed because it's stuck in writing the raid5/md bitmap, looks like it's waiting on bitmap->overflow_wait and not waking up?

so every other thread trying to get a new jbd2 trans handle had to wait for it, including the kswapd thread you mentioned.

Comment by Stephane Thiell [ 16/Apr/20 ]

Hi Dongyang,

Many thanks for your help! So I understand this could be a Linux mdraid problem.

I found this linux-raid thread that looks similar:

https://www.spinics.net/lists/raid/msg54720.html

Comment by Dongyang Li [ 17/Apr/20 ]

Correct it's more like a md problem.

the linux-raid thread is about, raid1 similar but not exactly the same, the follow up also mentioned there could be faulty devices, but I didn't see anything related from your logs.

Comment by Stephane Thiell [ 20/Apr/20 ]

I agree, it's not exactly the same problem nor the same conditions, but there is one similarity that caught my attention: the endless waiting on overflow_wait.

Because we're using CentOS, I don't think there is an efficient way to report this upstream though. Let's just hope this won't happen very often (so far, we've seen only one occurrence).

Generated at Sat Feb 10 03:01:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.