Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19292

Deadlock due to ZFS inline memory allocation

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.17.0, Lustre 2.16.1, Lustre 2.15.7
    • None
    • 3
    • 9223372036854775807

    Description

      If there is a Lustre client mount on the Lustre metadata server, there will be a deadlock situation happen when system under memory pressure. 

      Sample stacktrace:

      __switch_to+0xbc/0xfc
      __schedule+0x28c/0x718
      schedule+0x4c/0xcc
      schedule_timeout+0x84/0x170
      ptlrpc_set_wait+0x464/0x7c8 [ptlrpc]
      ptlrpc_queue_wait+0xa4/0x364 [ptlrpc]
      mdc_close+0x224/0xe64 [mdc]
      lmv_close+0x1a8/0x480 [lmv]
      ll_close_inode_openhandle+0x404/0xcc8 [lustre]
      ll_md_real_close+0xa4/0x280 [lustre]
      ll_clear_inode+0x1a0/0x7e0 [lustre]
      ll_delete_inode+0x70/0x260 [lustre]
      evict+0xdc/0x240
      iput_final+0x8c/0x1c0
      iput+0x10c/0x128
      dentry_unlink_inode+0xc8/0x150
      __dentry_kill+0xec/0x21c
      shrink_dentry_list+0xa8/0x138
      prune_dcache_sb+0x64/0x94
      super_cache_scan+0x128/0x1a4
      do_shrink_slab+0x194/0x394
      shrink_slab+0xbc/0x13c
      shrink_node_memcgs+0x1d4/0x230
      shrink_node+0x150/0x5e0
      shrink_zones+0x98/0x220
      do_try_to_free_pages+0xac/0x2e0
      try_to_free_pages+0x120/0x25c
      __alloc_pages_slowpath.constprop.0+0x40c/0x85c
      __alloc_pages_nodemask+0x2b4/0x308
      alloc_pages_current+0x8c/0x13c
      allocate_slab+0x3b8/0x4cc
      new_slab_objects+0x9c/0x160
      ___slab_alloc+0x1b0/0x300
      __slab_alloc+0x50/0x80
      kmem_cache_alloc+0x30c/0x32c
      spl_kmem_cache_alloc+0x84/0x1ac [spl]
      zfs_btree_add_idx+0x1b4/0x248 [zfs]
      range_tree_add_impl+0x868/0xd94 [zfs]
      range_tree_add+0x18/0x20 [zfs]
      dnode_free_range+0x194/0x6c0 [zfs]
      dmu_object_free+0x6c/0xc0 [zfs]
      osd_destroy+0x40c/0xb70 [osd_zfs]
      lod_sub_destroy+0x204/0x480 [lod]
      lod_destroy+0x2d8/0x800 [lod]
      mdd_close+0x250/0x1000 [mdd]
      mo_close+0x18/0x60 [mdt]
      mdt_hsm_release+0x534/0x16a0 [mdt]
      mdt_mfd_close+0x1b4/0xe68 [mdt]
      mdt_close_internal+0x104/0x3c8 [mdt]
      mdt_close+0x270/0x518 [mdt]
      tgt_handle_request0+0x2b4/0x658 [ptlrpc]
      tgt_request_handle+0x268/0xaac [ptlrpc]
      ptlrpc_server_handle_request.isra.0+0x460/0xf20 [ptlrpc]
      ptlrpc_main+0xd24/0x15bc [ptlrpc]
      kthread+0x118/0x120
      

      Analysis

      The deadlock begins in mdt_close() → mdt_hsm_release() → osd_destroy() → dmu_object_free() where ZFS object destruction triggers memory allocation (zfs_btree_add_idx() → kmem_cache_alloc()), but system memory pressure forces kernel memory reclaim (try_to_free_pages() → shrink_dentry_list() → prune_dcache_sb()). This reclaim evicts Lustre client-side inodes (evict() → ll_clear_inode() → ll_md_real_close()), which sends an RPC back to the same server (mdc_close() → ptlrpc_queue_wait() → ptlrpc_set_wait()).

       

      The resulting resource exhaustion prevents ZFS from assigning new transaction groups in osd_trans_start() → dmu_tx_assign() → dmu_tx_wait(), causing other operations to hang.

      Related Issues

      1. Similar issue but on a different path: https://jira.whamcloud.com/browse/LU-18246
        Previous attempted fix (not merged, also won't fix this issue since it's only covering one specific path): https://review.whamcloud.com/c/fs/lustre-release/+/56442
      2. Not deadlock situation, but if it get triggered from kthreadd, it will result in kernel panic: https://jira.whamcloud.com/browse/LU-18826

       

      Proposed Fix

      While there was a simple fix to cover paths that involves zfs api call with spl_fstrans_mark, but the similar issue can happen in other places as well. The root cause here is the behavior that Lustre allocate memory and block on (inline) memory allocation critical path may cause more crash/deadlock. (LU-18826 for example). 

      One more general fix idea I have in mind: make Lustre evict_inode (ll_delete_inode) operation executed in system inline memory allocation path return and free up resources without delay waiting for RPC. The required  RPC call to MDS can be scheduled and sent asynchronously using ptlrpcd. 

       

       

       

       

       

       

       

      Attachments

        Activity

          People

            lijinc Lijing Chen
            lijinc Lijing Chen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: