Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-20299

import hangups/evictions on new kernels

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Medium Medium
    • Lustre 2.18.0
    • None
    • None
    • 3
    • 9223372036854775807

      We got some reports that running on new ubuntu kernels and 64k pages causes clients to get stuck unable to reconnect when the server becomes available again.

      We would see competing tasks: userland code sitting in 

       #2 [ffff80014c1af1d0] schedule at ffffa170373e001c
       #3 [ffff80014c1af230] cl_sync_io_wait at ffffa16fee2f399c [obdclass]
       #4 [ffff80014c1af3a0] ll_io_read_page at ffffa16feeb4d40c [lustre]
       #5 [ffff80014c1af4d0] ll_readpage at ffffa16feeb4fd1c [lustre]
       #6 [ffff80014c1af530] ll_read_folio at ffffa16feeb50820 [lustre]
       #7 [ffff80014c1af550] filemap_read_folio at ffffa17036121410
       #8 [ffff80014c1af580] filemap_update_page at ffffa1703612183c
       #9 [ffff80014c1af640] filemap_get_pages at ffffa17036122550
      #10 [ffff80014c1af770] filemap_read at ffffa17036122818
      #11 [ffff80014c1af7d0] generic_file_read_iter at ffffa17036123e5c
      #12 [ffff80014c1af880] vvp_io_read_start at ffffa16feeb8e210 [lustre]
      #13 [ffff80014c1af8e0] cl_io_start at ffffa16fee2f00bc [obdclass]
      #14 [ffff80014c1af930] cl_io_loop at ffffa16fee2f5880 [obdclass]
      #15 [ffff80014c1afa60] ll_file_io_generic at ffffa16feeb1295c [lustre]
      #16 [ffff80014c1afb20] ll_file_read_iter at ffffa16feeb148bc [lustre]
      #17 [ffff80014c1afbe0] vfs_read at ffffa1703626e6e4
      

      and ldlm_bl threads trying to cancel pages from the same inode:

       #2 [ffff80012870f900] schedule at ffffa170373e001c
       #3 [ffff80012870f930] schedule_preempt_disabled at ffffa170373e0ae4
       #4 [ffff80012870f9c0] rwsem_down_write_slowpath at ffffa170373e5188
       #5 [ffff80012870fa20] down_write at ffffa170373e5740
       #6 [ffff80012870fa50] vvp_io_init at ffffa16feeb90bc4 [lustre]
       #7 [ffff80012870fac0] cl_io_init0 at ffffa16fee2ef5d8 [obdclass]
       #8 [ffff80012870fb00] cl_io_init at ffffa16fee2ef864 [obdclass]
       #9 [ffff80012870fb40] osc_lock_discard_pages at ffffa16fee8c4a00 [osc]
      #10 [ffff80012870fc00] osc_ldlm_blocking_ast at ffffa16fee89a1a4 [osc]
      #11 [ffff80012870fc90] ldlm_cancel_callback at ffffa16fee43b098 [ptlrpc]
      #12 [ffff80012870fcc0] ldlm_cli_cancel_local at ffffa16fee44fd9c [ptlrpc]
      #13 [ffff80012870fd50] ldlm_cli_cancel_list_local at ffffa16fee4548e0 [ptlrpc]
      #14 [ffff80012870fe10] ldlm_bl_thread_main at ffffa16fee45fb24 [ptlrpc] 

      This seems related to LU-16651 where the support for vfs address space invalidate lock was added.

      Some further analysis suggests that this patch holds the invalidate lock for too long on the discard path across entire cl io occurence and potentially spans multipe RPC boundaries which is prone for adverse sideefects.

            wc-triage WC Triage
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: