[LU-15127] import invalidation vs writeback deadlock Created: 19/Oct/21 Updated: 14/Aug/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Upstream |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Alex Zhuravlev | Assignee: | Patrick Farrell |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
racer hits this deadlock few times a day: schedule,osc_extent_wait,osc_cache_wait_range,osc_cache_writeback_range,osc_io_fsync_start,cl_io_start,lov_io_call,cl_io_start,cl_io_loop,cl_sync_file_range,ll_delete_inode,evict,__dentry_kill,dentry_kill,dput,ll_dirty_page_discard_warn,vvp_page_completion_write,cl_page_completion,osc_ap_completion,osc_extent_finish,brw_interpret,ptlrpc_check_set,ptlrpcd PIDs(1): "ptlrpcd_00_00":4889 schedule,osc_extent_wait,osc_cache_wait_range,osc_cache_writeback_range,osc_ldlm_blocking_ast,ldlm_cancel_callback,ldlm_cli_cancel_local,ldlm_cli_cancel,osc_ldlm_blocking_ast,ldlm_handle_bl_callback,ldlm_bl_thread_main PIDs(1): "ldlm_bl_02":7759 schedule,ptlrpc_invalidate_import,ptlrpc_invalidate_import_thread PIDs(1): "ll_imp_inval":293752 schedule,ptlrpc_invalidate_import,ptlrpc_set_import_active,osc_iocontrol,lov_iocontrol,ll_umount_begin,ksys_umount,__x64_sys_umount PIDs(1): "umount":449648 |
| Comments |
| Comment by Patrick Farrell [ 12/Nov/21 ] |
|
Ah, i think this is the classic dirty page discard warn issue… it takes a reference to print a debug message and can end up deadlocked because of it. |
| Comment by Patrick Farrell [ 12/Nov/21 ] |
|
I think I can figure out a fix (unless you’re already working on it). |
| Comment by Alex Zhuravlev [ 12/Nov/21 ] |
|
please, go ahead, I'm busy with another stuff. |
| Comment by Gerrit Updater [ 12/Nov/21 ] |
|
"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45550 |
| Comment by Alex Zhuravlev [ 24/Nov/21 ] |
|
the invalidation path still deadlock, but much less frequently. I guess this is a bit different issue? schedule,ldlm_completion_ast,ldlm_lock_match_with_skip,osc_enqueue_base,osc_lock_enqueue,cl_lock_enqueue,lov_lock_enqueue,cl_lock_enqueue,cl_lock_request,cl_io_lock,cl_io_loop,cl_setattr_ost,ll_setattr_raw,do_truncate,path_openat,do_filp_open,do_sys_open PIDs(1): "cp":367053 schedule,osc_object_invalidate,osc_ldlm_resource_invalidate,cfs_hash_for_each_relax,cfs_hash_for_each_nolock,osc_import_event,ptlrpc_invalidate_import,ptlrpc_invalidate_import_thread PIDs(1): "ll_imp_inval":384198 schedule,osc_object_invalidate,osc_ldlm_resource_invalidate,cfs_hash_for_each_relax,cfs_hash_for_each_nolock,osc_import_event,ptlrpc_invalidate_import,ptlrpc_set_import_active,osc_iocontrol,lov_iocontrol,ll_umount_begin,ksys_umount,__x64_sys_umount PIDs(1): "umount":384405 |
| Comment by Patrick Farrell [ 24/Nov/21 ] |
|
Hmm... So osc_object_invalidate is probably waiting for nr_ios to be zero, and that's incremented in cl_io_iter_init (osc_io_iter_init) before the lock request is made. So I guess it's waiting for that competion ast, which is waiting for the lock to be granted or cancelled. So somehow that lock isn't getting granted or cancelled I guess? I'm not quite sure how pending lock requests are cancelled when an import is invalidated. |
| Comment by Patrick Farrell [ 24/Nov/21 ] |
|
Are you able to dump the LDLM namespaces for that hang? |
| Comment by Patrick Farrell [ 24/Nov/21 ] |
|
So looking at osc_import_event for INVALIDATE, we call: ldlm_namespace_cleanup again ldlm_namespace_cleanup calls ldlm_resource_clean, then ldlm_resource_complain. ldlm_resource_complain shows that we can sometimes have locks left after ldlm_resource_clean So the thread doing 'cp' is trying to match an existing lock. So the lock survives the call to ldlm_resource_clean, then the osc oo_nr_ios is > 0, so we cannot invalidate the OSC object, so we do not try to clean up the lock again. I don't know why a lock would survive ldlm_resource_clean, but that seems like the issue. Maybe we need to call ldlm_resource_clean from osc_object_invalidate if oo_nr_ios is > 0 ? |
| Comment by Gerrit Updater [ 24/Nov/21 ] |
|
"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45658 |
| Comment by Patrick Farrell [ 24/Nov/21 ] |
|
There are a few fairly heroic guesses in that patch, but I think it's probably right... Alex, if you can try it in your test rig... |