Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.7.0
-
None
-
3
-
9223372036854775807
Description
Application is defined as hung because its tasks do not exit 35 minutes after job manager marks application as completed.
Server: Feb 17 04:13:43 snx11242n004 kernel: LustreError: 0:0:(ldlm_lockd.c:343:waiting_locks_callback()) ### lock callback timer expired after 260s: evicting client at 748@gni ns: filter-snx11242-OST0000_UUID lock: ffff880bf596fb00/0xf1d1c2b2077a5cbd lrc: 3/0,0 mode: PW/PW res: [0xf44037d:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->4194303) flags: 0x60000000020020 nid: 748@gni remote: 0xfeb3b0aabd5c0e98 cl: 4a576988-5c50-316d-4ee1-51202af4957d expref: 155 pid: 30836 timeout: 12436906645 lvb_type: 0 Feb 17 04:13:43 snx11242n004 kernel: LustreError: 11235:0:(ldlm_extent.c:733:ldlm_resource_prolong()) Failed to get resource for resid 256141005/0 Feb 17 04:13:43 snx11242n004 kernel: Lustre: snx11242-OST0000: Connection restored to 5ba334ef-089c-ccf0-265f-b8328dafb033 (at 748@gni) Feb 17 04:13:43 snx11242n004 kernel: Lustre: Skipped 785 previous similar messages Feb 17 04:13:44 snx11242n004 kernel: LustreError: 11330:0:(ldlm_extent.c:733:ldlm_resource_prolong()) Failed to get resource for resid 256136005/0 Feb 17 04:13:44 snx11242n004 kernel: LustreError: 11330:0:(ldlm_extent.c:733:ldlm_resource_prolong()) Skipped 7 previous similar messages Feb 17 04:13:48 snx11242n004 kernel: LustreError: 11298:0:(ldlm_extent.c:733:ldlm_resource_prolong()) Failed to get resource for resid 256122474/0 Feb 17 04:25:00 snx11242n004 kernel: Lustre: snx11242-OST0000: haven't heard from client 4a576988-5c50-316d-4ee1-51202af4957d (at 748@gni) in 677 seconds. I think it's dead, and I am evicting it. exp ffff88029ca05c00, cur 1518863100 expire 1518862650 last 1518862423 obd_device obd_name ip_address client_obd obd_import imp_state ish_time index conn_cnt ========================================================================================================================= 0xffff880f8e498188 snx11242-OST0004 LNET_NID_ANY 0xffff880f8e498668 0xffff880f814c3000 EVICTED 1518864015 4 2 0xffff880f8e8801c8 snx11242-OST0000 LNET_NID_ANY 0xffff880f8e8806a8 0xffff880f823fb800 EVICTED 1518862423 4 2 Sent RPCS: ptlrpc_request_set.set_requests->rq_set_chain thread pid ptlrpc_request xid nid opc phase:flags R:W sent/deadline ptlrpc_body ================================================================================================================================================== ptlrpcd_01_24: 5946 0xffff881f98f96640 1592609764500656 snx11242-OST0004 4 RPC:RE 0:1 1518862418/1518862513 0xffff881f9f4bd638 ptlrpcd_01_24: 5946 0xffff881f9cbd0380 1592609764494945 snx11242-OST0000 4 NtrPrt:Ee 0:1 1518862417/1518862512 0xffff881f9f283838 ================================================================================================================================================== crash_x86_64> bt 5946 PID: 5946 TASK: ffff880f8a888100 CPU: 35 COMMAND: "ptlrpcd_01_24" #0 [ffff880f8e0734e8] schedule at ffffffff8151c7d7 #1 [ffff880f8e073540] osc_extent_wait at ffffffffa0a1b6a0 [osc] #2 [ffff880f8e073660] osc_cache_wait_range at ffffffffa0a21f4c [osc] #3 [ffff880f8e073750] osc_io_fsync_end at ffffffffa0a0e7c6 [osc] #4 [ffff880f8e073788] cl_io_end at ffffffffa05c4b65 [obdclass] #5 [ffff880f8e0737b0] lov_io_end_wrapper at ffffffffa0837314 [lov] #6 [ffff880f8e0737d0] lov_io_fsync_end at ffffffffa083766e [lov] #7 [ffff880f8e073800] cl_io_end at ffffffffa05c4b65 [obdclass] #8 [ffff880f8e073828] cl_io_loop at ffffffffa05c817c [obdclass] #9 [ffff880f8e073858] cl_sync_file_range at ffffffffa08ecb75 [lustre] #10 [ffff880f8e0738b0] ll_delete_inode at ffffffffa09006af [lustre] #11 [ffff880f8e0738d8] evict at ffffffff811d0807 #12 [ffff880f8e073900] iput at ffffffff811d1056 #13 [ffff880f8e073930] __dentry_kill at ffffffff811cc605 #14 [ffff880f8e073958] dput at ffffffff811cc833 #15 [ffff880f8e073980] ll_dirty_page_discard_warn at ffffffffa090f033 [lustre] #16 [ffff880f8e0739f8] vvp_page_completion_write at ffffffffa0940f49 [lustre] #17 [ffff880f8e073a30] cl_page_completion at ffffffffa05c0a33 [obdclass] #18 [ffff880f8e073a78] osc_ap_completion at ffffffffa0a136ec [osc] #19 [ffff880f8e073ac0] osc_extent_finish at ffffffffa0a18db1 [osc] #20 [ffff880f8e073bc0] brw_interpret at ffffffffa09ff29b [osc] #21 [ffff880f8e073cb8] ptlrpc_check_set at ffffffffa0700233 [ptlrpc] #22 [ffff880f8e073d80] ptlrpcd_check at ffffffffa072cf1a [ptlrpc] #23 [ffff880f8e073dc8] ptlrpcd at ffffffffa072d3a5 [ptlrpc #24 [ffff880f8e073ec8] kthread at ffffffff81078646 #25 [ffff880f8e073f50] ret_from_fork at ffffffff81520e8f
What's going on:
The client is evicted by OST0006 while it has an outstanding bulk write rpc in progress. When the ptlrpcd daemon handles the bulk write reply, it sees the EIO (-5) error and issues the warning that dirty pages are being discarded (ll_dirty_page_discard_warn()). Lustre attempts to include the file name in the warning, but at the point the warning is issued, the ptlrpcd thread only has access to the page and its inode. It must therefore do a path lookup, getting references on the dentry and inode in the process. While it holds these references, all other references are dropped, so when the ptlrpcd thread releases its references it is responsible for deleting the inode and flushing the dirty pages associated with the inode. But the dirty pages are already being written by the failed bulk rpc, so the ptlrpcd daemon deadlocks itself.
The cleanest fix is to remove the file name from the warning message. The message contains the fid so an administrator can still identify the file. Furthermore, in most real life cases of this warning (even those that deadlock), the file name is never successfully retrieved.
See LU-12915 for a different approach to reporting the names of files that have had dirty pages discarded.