Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15340

client stuck unable to complete eviction with "still on delayed list" messages printed



    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • Lustre 2.15.0
    • Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807


      Sometimes client in eviction gets stuck, unable to complete it. The symptoms include ll_imp_inval thread printing "still on delayed list" for some stuck RPC(s) and an import showing in EVICTED stated in the device list.

      The problem seems to be a deadlock between a request on the delayed list that was signalled to complete, and ptlrpc thread that's supposed to take care of it being stuck waiting for the request to finish like this:

      PID: 1931   TASK: ffff8800d60a9110  CPU: 3   COMMAND: "ptlrpcd_01_01"
       #0 [ffff8800c8cfb3e8] __schedule at ffffffff817e3e22
       #1 [ffff8800c8cfb450] schedule at ffffffff817e4339
       #2 [ffff8800c8cfb460] osc_extent_wait at ffffffffa086a0cd [osc]
       #3 [ffff8800c8cfb590] osc_cache_wait_range at ffffffffa086c5ad [osc]
       #4 [ffff8800c8cfb688] osc_cache_writeback_range at ffffffffa086d56e [osc]
       #5 [ffff8800c8cfb7d0] osc_io_fsync_start at ffffffffa085b735 [osc]
       #6 [ffff8800c8cfb810] cl_io_start at ffffffffa0325a8d [obdclass]
       #7 [ffff8800c8cfb840] lov_io_call at ffffffffa08ca9f5 [lov]
       #8 [ffff8800c8cfb878] lov_io_start at ffffffffa08cabc6 [lov]
       #9 [ffff8800c8cfb898] cl_io_start at ffffffffa0325a8d [obdclass]
      #10 [ffff8800c8cfb8c8] cl_io_loop at ffffffffa032803f [obdclass]
      #11 [ffff8800c8cfb900] cl_sync_file_range at ffffffffa0e0b7eb [lustre]
      #12 [ffff8800c8cfb958] ll_delete_inode at ffffffffa0e2686c [lustre]
      #13 [ffff8800c8cfb970] evict at ffffffff81263a8f
      #14 [ffff8800c8cfb998] iput at ffffffff81263ec5
      #15 [ffff8800c8cfb9c8] __dentry_kill at ffffffff8125efc8
      #16 [ffff8800c8cfb9f0] dput at ffffffff8125f78a
      #17 [ffff8800c8cfba20] ll_dirty_page_discard_warn at ffffffffa0e2c205 [lustre]
      #18 [ffff8800c8cfba90] vvp_page_completion_write at ffffffffa0e5a214 [lustre]
      #19 [ffff8800c8cfbac0] cl_page_completion at ffffffffa03205e8 [obdclass]
      #20 [ffff8800c8cfbb18] osc_ap_completion at ffffffffa08609b9 [osc]
      #21 [ffff8800c8cfbb60] osc_extent_finish at ffffffffa0867792 [osc]
      #22 [ffff8800c8cfbc60] brw_interpret at ffffffffa0849ee9 [osc]
      #23 [ffff8800c8cfbcd8] ptlrpc_check_set at ffffffffa05fe4da [ptlrpc]
      #24 [ffff8800c8cfbd90] ptlrpcd at ffffffffa062f014 [ptlrpc]
      #25 [ffff8800c8cfbea8] kthread at ffffffff810ba114
      #26 [ffff8800c8cfbf50] ret_from_fork_nospec_begin at ffffffff817f1e5d 

      Of course blocking in ptlrpcd thread is a big no-no exactly for this sort of deadlocks.

      it sounds like we need to kick ll_dirty_page_discard_warn() call from vvp_vmpage_error() into a separate thread to ensure we are not blocking the brw_interpret where this is normally called from. or perhaps just the dput in there?


        Issue Links



              green Oleg Drokin
              green Oleg Drokin
              0 Vote for this issue
              4 Start watching this issue