[LU-15340] client stuck unable to complete eviction with "still on delayed list" messages printed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.15.0
Affects Version/s: Lustre 2.15.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Sometimes client in eviction gets stuck, unable to complete it. The symptoms include ll_imp_inval thread printing "still on delayed list" for some stuck RPC(s) and an import showing in EVICTED stated in the device list.

The problem seems to be a deadlock between a request on the delayed list that was signalled to complete, and ptlrpc thread that's supposed to take care of it being stuck waiting for the request to finish like this:

PID: 1931   TASK: ffff8800d60a9110  CPU: 3   COMMAND: "ptlrpcd_01_01"
 #0 [ffff8800c8cfb3e8] __schedule at ffffffff817e3e22
 #1 [ffff8800c8cfb450] schedule at ffffffff817e4339
 #2 [ffff8800c8cfb460] osc_extent_wait at ffffffffa086a0cd [osc]
 #3 [ffff8800c8cfb590] osc_cache_wait_range at ffffffffa086c5ad [osc]
 #4 [ffff8800c8cfb688] osc_cache_writeback_range at ffffffffa086d56e [osc]
 #5 [ffff8800c8cfb7d0] osc_io_fsync_start at ffffffffa085b735 [osc]
 #6 [ffff8800c8cfb810] cl_io_start at ffffffffa0325a8d [obdclass]
 #7 [ffff8800c8cfb840] lov_io_call at ffffffffa08ca9f5 [lov]
 #8 [ffff8800c8cfb878] lov_io_start at ffffffffa08cabc6 [lov]
 #9 [ffff8800c8cfb898] cl_io_start at ffffffffa0325a8d [obdclass]
#10 [ffff8800c8cfb8c8] cl_io_loop at ffffffffa032803f [obdclass]
#11 [ffff8800c8cfb900] cl_sync_file_range at ffffffffa0e0b7eb [lustre]
#12 [ffff8800c8cfb958] ll_delete_inode at ffffffffa0e2686c [lustre]
#13 [ffff8800c8cfb970] evict at ffffffff81263a8f
#14 [ffff8800c8cfb998] iput at ffffffff81263ec5
#15 [ffff8800c8cfb9c8] __dentry_kill at ffffffff8125efc8
#16 [ffff8800c8cfb9f0] dput at ffffffff8125f78a
#17 [ffff8800c8cfba20] ll_dirty_page_discard_warn at ffffffffa0e2c205 [lustre]
#18 [ffff8800c8cfba90] vvp_page_completion_write at ffffffffa0e5a214 [lustre]
#19 [ffff8800c8cfbac0] cl_page_completion at ffffffffa03205e8 [obdclass]
#20 [ffff8800c8cfbb18] osc_ap_completion at ffffffffa08609b9 [osc]
#21 [ffff8800c8cfbb60] osc_extent_finish at ffffffffa0867792 [osc]
#22 [ffff8800c8cfbc60] brw_interpret at ffffffffa0849ee9 [osc]
#23 [ffff8800c8cfbcd8] ptlrpc_check_set at ffffffffa05fe4da [ptlrpc]
#24 [ffff8800c8cfbd90] ptlrpcd at ffffffffa062f014 [ptlrpc]
#25 [ffff8800c8cfbea8] kthread at ffffffff810ba114
#26 [ffff8800c8cfbf50] ret_from_fork_nospec_begin at ffffffff817f1e5d

Of course blocking in ptlrpcd thread is a big no-no exactly for this sort of deadlocks.

it sounds like we need to kick ll_dirty_page_discard_warn() call from vvp_vmpage_error() into a separate thread to ensure we are not blocking the brw_interpret where this is normally called from. or perhaps just the dput in there?

Attachments

Issue Links

is related to

LU-15127 import invalidation vs writeback deadlock

Resolved

Activity

People

Assignee:: Oleg Drokin

Reporter:: Oleg Drokin

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 08/Dec/21 4:14 AM

Updated:: 24/Jun/22 6:04 PM

Resolved:: 31/Jan/22 4:30 AM