Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.15.0
-
None
-
3
-
9223372036854775807
Description
Sometimes client in eviction gets stuck, unable to complete it. The symptoms include ll_imp_inval thread printing "still on delayed list" for some stuck RPC(s) and an import showing in EVICTED stated in the device list.
The problem seems to be a deadlock between a request on the delayed list that was signalled to complete, and ptlrpc thread that's supposed to take care of it being stuck waiting for the request to finish like this:
PID: 1931 TASK: ffff8800d60a9110 CPU: 3 COMMAND: "ptlrpcd_01_01" #0 [ffff8800c8cfb3e8] __schedule at ffffffff817e3e22 #1 [ffff8800c8cfb450] schedule at ffffffff817e4339 #2 [ffff8800c8cfb460] osc_extent_wait at ffffffffa086a0cd [osc] #3 [ffff8800c8cfb590] osc_cache_wait_range at ffffffffa086c5ad [osc] #4 [ffff8800c8cfb688] osc_cache_writeback_range at ffffffffa086d56e [osc] #5 [ffff8800c8cfb7d0] osc_io_fsync_start at ffffffffa085b735 [osc] #6 [ffff8800c8cfb810] cl_io_start at ffffffffa0325a8d [obdclass] #7 [ffff8800c8cfb840] lov_io_call at ffffffffa08ca9f5 [lov] #8 [ffff8800c8cfb878] lov_io_start at ffffffffa08cabc6 [lov] #9 [ffff8800c8cfb898] cl_io_start at ffffffffa0325a8d [obdclass] #10 [ffff8800c8cfb8c8] cl_io_loop at ffffffffa032803f [obdclass] #11 [ffff8800c8cfb900] cl_sync_file_range at ffffffffa0e0b7eb [lustre] #12 [ffff8800c8cfb958] ll_delete_inode at ffffffffa0e2686c [lustre] #13 [ffff8800c8cfb970] evict at ffffffff81263a8f #14 [ffff8800c8cfb998] iput at ffffffff81263ec5 #15 [ffff8800c8cfb9c8] __dentry_kill at ffffffff8125efc8 #16 [ffff8800c8cfb9f0] dput at ffffffff8125f78a #17 [ffff8800c8cfba20] ll_dirty_page_discard_warn at ffffffffa0e2c205 [lustre] #18 [ffff8800c8cfba90] vvp_page_completion_write at ffffffffa0e5a214 [lustre] #19 [ffff8800c8cfbac0] cl_page_completion at ffffffffa03205e8 [obdclass] #20 [ffff8800c8cfbb18] osc_ap_completion at ffffffffa08609b9 [osc] #21 [ffff8800c8cfbb60] osc_extent_finish at ffffffffa0867792 [osc] #22 [ffff8800c8cfbc60] brw_interpret at ffffffffa0849ee9 [osc] #23 [ffff8800c8cfbcd8] ptlrpc_check_set at ffffffffa05fe4da [ptlrpc] #24 [ffff8800c8cfbd90] ptlrpcd at ffffffffa062f014 [ptlrpc] #25 [ffff8800c8cfbea8] kthread at ffffffff810ba114 #26 [ffff8800c8cfbf50] ret_from_fork_nospec_begin at ffffffff817f1e5d
Of course blocking in ptlrpcd thread is a big no-no exactly for this sort of deadlocks.
it sounds like we need to kick ll_dirty_page_discard_warn() call from vvp_vmpage_error() into a separate thread to ensure we are not blocking the brw_interpret where this is normally called from. or perhaps just the dput in there?
Attachments
Issue Links
- is related to
-
LU-15127 import invalidation vs writeback deadlock
-
- Resolved
-