[LU-15340] client stuck unable to complete eviction with "still on delayed list" messages printed Created: 08/Dec/21  Updated: 24/Jun/22  Resolved: 31/Jan/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Critical
Reporter: Oleg Drokin Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-15127 import invalidation vs writeback dead... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Sometimes client in eviction gets stuck, unable to complete it. The symptoms include ll_imp_inval thread printing "still on delayed list" for some stuck RPC(s) and an import showing in EVICTED stated in the device list.

The problem seems to be a deadlock between a request on the delayed list that was signalled to complete, and ptlrpc thread that's supposed to take care of it being stuck waiting for the request to finish like this:

PID: 1931   TASK: ffff8800d60a9110  CPU: 3   COMMAND: "ptlrpcd_01_01"
 #0 [ffff8800c8cfb3e8] __schedule at ffffffff817e3e22
 #1 [ffff8800c8cfb450] schedule at ffffffff817e4339
 #2 [ffff8800c8cfb460] osc_extent_wait at ffffffffa086a0cd [osc]
 #3 [ffff8800c8cfb590] osc_cache_wait_range at ffffffffa086c5ad [osc]
 #4 [ffff8800c8cfb688] osc_cache_writeback_range at ffffffffa086d56e [osc]
 #5 [ffff8800c8cfb7d0] osc_io_fsync_start at ffffffffa085b735 [osc]
 #6 [ffff8800c8cfb810] cl_io_start at ffffffffa0325a8d [obdclass]
 #7 [ffff8800c8cfb840] lov_io_call at ffffffffa08ca9f5 [lov]
 #8 [ffff8800c8cfb878] lov_io_start at ffffffffa08cabc6 [lov]
 #9 [ffff8800c8cfb898] cl_io_start at ffffffffa0325a8d [obdclass]
#10 [ffff8800c8cfb8c8] cl_io_loop at ffffffffa032803f [obdclass]
#11 [ffff8800c8cfb900] cl_sync_file_range at ffffffffa0e0b7eb [lustre]
#12 [ffff8800c8cfb958] ll_delete_inode at ffffffffa0e2686c [lustre]
#13 [ffff8800c8cfb970] evict at ffffffff81263a8f
#14 [ffff8800c8cfb998] iput at ffffffff81263ec5
#15 [ffff8800c8cfb9c8] __dentry_kill at ffffffff8125efc8
#16 [ffff8800c8cfb9f0] dput at ffffffff8125f78a
#17 [ffff8800c8cfba20] ll_dirty_page_discard_warn at ffffffffa0e2c205 [lustre]
#18 [ffff8800c8cfba90] vvp_page_completion_write at ffffffffa0e5a214 [lustre]
#19 [ffff8800c8cfbac0] cl_page_completion at ffffffffa03205e8 [obdclass]
#20 [ffff8800c8cfbb18] osc_ap_completion at ffffffffa08609b9 [osc]
#21 [ffff8800c8cfbb60] osc_extent_finish at ffffffffa0867792 [osc]
#22 [ffff8800c8cfbc60] brw_interpret at ffffffffa0849ee9 [osc]
#23 [ffff8800c8cfbcd8] ptlrpc_check_set at ffffffffa05fe4da [ptlrpc]
#24 [ffff8800c8cfbd90] ptlrpcd at ffffffffa062f014 [ptlrpc]
#25 [ffff8800c8cfbea8] kthread at ffffffff810ba114
#26 [ffff8800c8cfbf50] ret_from_fork_nospec_begin at ffffffff817f1e5d 

Of course blocking in ptlrpcd thread is a big no-no exactly for this sort of deadlocks.

it sounds like we need to kick ll_dirty_page_discard_warn() call from vvp_vmpage_error() into a separate thread to ensure we are not blocking the brw_interpret where this is normally called from. or perhaps just the dput in there?



 Comments   
Comment by Gerrit Updater [ 08/Dec/21 ]

"Oleg Drokin <green@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45784
Subject: LU-15340 llite: Delay dput in ll_dirty_page_discard_warn
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6bf95c9aa5506b7a13fe7148743cf64d28c54ea2

Comment by Gerrit Updater [ 25/Jan/22 ]

"Oleg Drokin <green@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46296
Subject: LU-15340 llite: Reuse existing inode for io warning print
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1bec36589f16038d66051af5be25e8deb94a7098

Comment by Gerrit Updater [ 31/Jan/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45784/
Subject: LU-15340 llite: Delay dput in ll_dirty_page_discard_warn
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a1d75780ba19cfca53cbacf0d38e8d7df540b209

Comment by Peter Jones [ 31/Jan/22 ]

Landed for 2.15

Comment by Gerrit Updater [ 14/Mar/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46818
Subject: LU-15340 llite: Delay dput in ll_dirty_page_discard_warn
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: dc5d2593e1d85ee641ee6de72ad55437cdff75c2

Generated at Sat Feb 10 03:17:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.