Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.17.0, Lustre 2.16.1, Lustre 2.15.7
-
None
-
3
-
9223372036854775807
Description
If there is a Lustre client mount on the Lustre metadata server, there will be a deadlock situation happen when system under memory pressure.
Sample stacktrace:
__switch_to+0xbc/0xfc __schedule+0x28c/0x718 schedule+0x4c/0xcc schedule_timeout+0x84/0x170 ptlrpc_set_wait+0x464/0x7c8 [ptlrpc] ptlrpc_queue_wait+0xa4/0x364 [ptlrpc] mdc_close+0x224/0xe64 [mdc] lmv_close+0x1a8/0x480 [lmv] ll_close_inode_openhandle+0x404/0xcc8 [lustre] ll_md_real_close+0xa4/0x280 [lustre] ll_clear_inode+0x1a0/0x7e0 [lustre] ll_delete_inode+0x70/0x260 [lustre] evict+0xdc/0x240 iput_final+0x8c/0x1c0 iput+0x10c/0x128 dentry_unlink_inode+0xc8/0x150 __dentry_kill+0xec/0x21c shrink_dentry_list+0xa8/0x138 prune_dcache_sb+0x64/0x94 super_cache_scan+0x128/0x1a4 do_shrink_slab+0x194/0x394 shrink_slab+0xbc/0x13c shrink_node_memcgs+0x1d4/0x230 shrink_node+0x150/0x5e0 shrink_zones+0x98/0x220 do_try_to_free_pages+0xac/0x2e0 try_to_free_pages+0x120/0x25c __alloc_pages_slowpath.constprop.0+0x40c/0x85c __alloc_pages_nodemask+0x2b4/0x308 alloc_pages_current+0x8c/0x13c allocate_slab+0x3b8/0x4cc new_slab_objects+0x9c/0x160 ___slab_alloc+0x1b0/0x300 __slab_alloc+0x50/0x80 kmem_cache_alloc+0x30c/0x32c spl_kmem_cache_alloc+0x84/0x1ac [spl] zfs_btree_add_idx+0x1b4/0x248 [zfs] range_tree_add_impl+0x868/0xd94 [zfs] range_tree_add+0x18/0x20 [zfs] dnode_free_range+0x194/0x6c0 [zfs] dmu_object_free+0x6c/0xc0 [zfs] osd_destroy+0x40c/0xb70 [osd_zfs] lod_sub_destroy+0x204/0x480 [lod] lod_destroy+0x2d8/0x800 [lod] mdd_close+0x250/0x1000 [mdd] mo_close+0x18/0x60 [mdt] mdt_hsm_release+0x534/0x16a0 [mdt] mdt_mfd_close+0x1b4/0xe68 [mdt] mdt_close_internal+0x104/0x3c8 [mdt] mdt_close+0x270/0x518 [mdt] tgt_handle_request0+0x2b4/0x658 [ptlrpc] tgt_request_handle+0x268/0xaac [ptlrpc] ptlrpc_server_handle_request.isra.0+0x460/0xf20 [ptlrpc] ptlrpc_main+0xd24/0x15bc [ptlrpc] kthread+0x118/0x120
Analysis
The deadlock begins in mdt_close() → mdt_hsm_release() → osd_destroy() → dmu_object_free() where ZFS object destruction triggers memory allocation (zfs_btree_add_idx() → kmem_cache_alloc()), but system memory pressure forces kernel memory reclaim (try_to_free_pages() → shrink_dentry_list() → prune_dcache_sb()). This reclaim evicts Lustre client-side inodes (evict() → ll_clear_inode() → ll_md_real_close()), which sends an RPC back to the same server (mdc_close() → ptlrpc_queue_wait() → ptlrpc_set_wait()).
The resulting resource exhaustion prevents ZFS from assigning new transaction groups in osd_trans_start() → dmu_tx_assign() → dmu_tx_wait(), causing other operations to hang.
Related Issues
- Similar issue but on a different path: https://jira.whamcloud.com/browse/LU-18246
Previous attempted fix (not merged, also won't fix this issue since it's only covering one specific path): https://review.whamcloud.com/c/fs/lustre-release/+/56442 - Not deadlock situation, but if it get triggered from kthreadd, it will result in kernel panic: https://jira.whamcloud.com/browse/LU-18826
Proposed Fix
While there was a simple fix to cover paths that involves zfs api call with spl_fstrans_mark, but the similar issue can happen in other places as well. The root cause here is the behavior that Lustre allocate memory and block on (inline) memory allocation critical path may cause more crash/deadlock. (LU-18826 for example).
One more general fix idea I have in mind: make Lustre evict_inode (ll_delete_inode) operation executed in system inline memory allocation path return and free up resources without delay waiting for RPC. The required RPC call to MDS can be scheduled and sent asynchronously using ptlrpcd.