[LU-15340] client stuck unable to complete eviction with "still on delayed list" messages printed Created: 08/Dec/21 Updated: 24/Jun/22 Resolved: 31/Jan/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Oleg Drokin | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Sometimes client in eviction gets stuck, unable to complete it. The symptoms include ll_imp_inval thread printing "still on delayed list" for some stuck RPC(s) and an import showing in EVICTED stated in the device list. The problem seems to be a deadlock between a request on the delayed list that was signalled to complete, and ptlrpc thread that's supposed to take care of it being stuck waiting for the request to finish like this: PID: 1931 TASK: ffff8800d60a9110 CPU: 3 COMMAND: "ptlrpcd_01_01" #0 [ffff8800c8cfb3e8] __schedule at ffffffff817e3e22 #1 [ffff8800c8cfb450] schedule at ffffffff817e4339 #2 [ffff8800c8cfb460] osc_extent_wait at ffffffffa086a0cd [osc] #3 [ffff8800c8cfb590] osc_cache_wait_range at ffffffffa086c5ad [osc] #4 [ffff8800c8cfb688] osc_cache_writeback_range at ffffffffa086d56e [osc] #5 [ffff8800c8cfb7d0] osc_io_fsync_start at ffffffffa085b735 [osc] #6 [ffff8800c8cfb810] cl_io_start at ffffffffa0325a8d [obdclass] #7 [ffff8800c8cfb840] lov_io_call at ffffffffa08ca9f5 [lov] #8 [ffff8800c8cfb878] lov_io_start at ffffffffa08cabc6 [lov] #9 [ffff8800c8cfb898] cl_io_start at ffffffffa0325a8d [obdclass] #10 [ffff8800c8cfb8c8] cl_io_loop at ffffffffa032803f [obdclass] #11 [ffff8800c8cfb900] cl_sync_file_range at ffffffffa0e0b7eb [lustre] #12 [ffff8800c8cfb958] ll_delete_inode at ffffffffa0e2686c [lustre] #13 [ffff8800c8cfb970] evict at ffffffff81263a8f #14 [ffff8800c8cfb998] iput at ffffffff81263ec5 #15 [ffff8800c8cfb9c8] __dentry_kill at ffffffff8125efc8 #16 [ffff8800c8cfb9f0] dput at ffffffff8125f78a #17 [ffff8800c8cfba20] ll_dirty_page_discard_warn at ffffffffa0e2c205 [lustre] #18 [ffff8800c8cfba90] vvp_page_completion_write at ffffffffa0e5a214 [lustre] #19 [ffff8800c8cfbac0] cl_page_completion at ffffffffa03205e8 [obdclass] #20 [ffff8800c8cfbb18] osc_ap_completion at ffffffffa08609b9 [osc] #21 [ffff8800c8cfbb60] osc_extent_finish at ffffffffa0867792 [osc] #22 [ffff8800c8cfbc60] brw_interpret at ffffffffa0849ee9 [osc] #23 [ffff8800c8cfbcd8] ptlrpc_check_set at ffffffffa05fe4da [ptlrpc] #24 [ffff8800c8cfbd90] ptlrpcd at ffffffffa062f014 [ptlrpc] #25 [ffff8800c8cfbea8] kthread at ffffffff810ba114 #26 [ffff8800c8cfbf50] ret_from_fork_nospec_begin at ffffffff817f1e5d Of course blocking in ptlrpcd thread is a big no-no exactly for this sort of deadlocks. it sounds like we need to kick ll_dirty_page_discard_warn() call from vvp_vmpage_error() into a separate thread to ensure we are not blocking the brw_interpret where this is normally called from. or perhaps just the dput in there? |
| Comments |
| Comment by Gerrit Updater [ 08/Dec/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45784 |
| Comment by Gerrit Updater [ 25/Jan/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46296 |
| Comment by Gerrit Updater [ 31/Jan/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45784/ |
| Comment by Peter Jones [ 31/Jan/22 ] |
|
Landed for 2.15 |
| Comment by Gerrit Updater [ 14/Mar/22 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46818 |