[LU-12522] Deadlock: ptlrpcd daemon blocked in osc_extent_wait Created: 08/Jul/19  Updated: 12/Nov/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Ann Koehler (Inactive) Assignee: Ann Koehler (Inactive)
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-15127 import invalidation vs writeback dead... Open
Related
is related to LU-12915 Add a debugfs variable listing fids w... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Application is defined as hung because its tasks do not exit 35 minutes after job manager marks application as completed.

Server:
Feb 17 04:13:43 snx11242n004 kernel: LustreError: 0:0:(ldlm_lockd.c:343:waiting_locks_callback()) ### lock callback timer expired after 260s: evicting client at 748@gni  ns: filter-snx11242-OST0000_UUID lock: ffff880bf596fb00/0xf1d1c2b2077a5cbd lrc: 3/0,0 mode: PW/PW res: [0xf44037d:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->4194303) flags: 0x60000000020020 nid: 748@gni remote: 0xfeb3b0aabd5c0e98 cl: 4a576988-5c50-316d-4ee1-51202af4957d expref: 155 pid: 30836 timeout: 12436906645 lvb_type: 0
Feb 17 04:13:43 snx11242n004 kernel: LustreError: 11235:0:(ldlm_extent.c:733:ldlm_resource_prolong()) Failed to get resource for resid 256141005/0
Feb 17 04:13:43 snx11242n004 kernel: Lustre: snx11242-OST0000: Connection restored to 5ba334ef-089c-ccf0-265f-b8328dafb033 (at 748@gni)
Feb 17 04:13:43 snx11242n004 kernel: Lustre: Skipped 785 previous similar messages
Feb 17 04:13:44 snx11242n004 kernel: LustreError: 11330:0:(ldlm_extent.c:733:ldlm_resource_prolong()) Failed to get resource for resid 256136005/0
Feb 17 04:13:44 snx11242n004 kernel: LustreError: 11330:0:(ldlm_extent.c:733:ldlm_resource_prolong()) Skipped 7 previous similar messages
Feb 17 04:13:48 snx11242n004 kernel: LustreError: 11298:0:(ldlm_extent.c:733:ldlm_resource_prolong()) Failed to get resource for resid 256122474/0
Feb 17 04:25:00 snx11242n004 kernel: Lustre: snx11242-OST0000: haven't heard from client 4a576988-5c50-316d-4ee1-51202af4957d (at 748@gni) in 677 seconds. I think it's dead, and I am evicting it. exp ffff88029ca05c00, cur 1518863100 expire 1518862650 last 1518862423 

obd_device          obd_name                   ip_address             client_obd          obd_import          imp_state    ish_time   index   conn_cnt
=========================================================================================================================
0xffff880f8e498188  snx11242-OST0004          LNET_NID_ANY               0xffff880f8e498668  0xffff880f814c3000  EVICTED      1518864015 4     2
0xffff880f8e8801c8  snx11242-OST0000          LNET_NID_ANY               0xffff880f8e8806a8  0xffff880f823fb800  EVICTED      1518862423 4     2

Sent RPCS: ptlrpc_request_set.set_requests->rq_set_chain

thread         pid    ptlrpc_request      xid                nid                 opc  phase:flags    R:W  sent/deadline          ptlrpc_body

==================================================================================================================================================

ptlrpcd_01_24: 5946   0xffff881f98f96640  1592609764500656   snx11242-OST0004    4    RPC:RE         0:1  1518862418/1518862513  0xffff881f9f4bd638

ptlrpcd_01_24: 5946   0xffff881f9cbd0380  1592609764494945   snx11242-OST0000    4    NtrPrt:Ee      0:1  1518862417/1518862512  0xffff881f9f283838

==================================================================================================================================================

crash_x86_64> bt 5946
PID: 5946   TASK: ffff880f8a888100  CPU: 35  COMMAND: "ptlrpcd_01_24"
 #0 [ffff880f8e0734e8] schedule at ffffffff8151c7d7
 #1 [ffff880f8e073540] osc_extent_wait at ffffffffa0a1b6a0 [osc]
 #2 [ffff880f8e073660] osc_cache_wait_range at ffffffffa0a21f4c [osc]
 #3 [ffff880f8e073750] osc_io_fsync_end at ffffffffa0a0e7c6 [osc]
 #4 [ffff880f8e073788] cl_io_end at ffffffffa05c4b65 [obdclass]
 #5 [ffff880f8e0737b0] lov_io_end_wrapper at ffffffffa0837314 [lov]
 #6 [ffff880f8e0737d0] lov_io_fsync_end at ffffffffa083766e [lov]
 #7 [ffff880f8e073800] cl_io_end at ffffffffa05c4b65 [obdclass]
 #8 [ffff880f8e073828] cl_io_loop at ffffffffa05c817c [obdclass]
 #9 [ffff880f8e073858] cl_sync_file_range at ffffffffa08ecb75 [lustre]
#10 [ffff880f8e0738b0] ll_delete_inode at ffffffffa09006af [lustre]
#11 [ffff880f8e0738d8] evict at ffffffff811d0807
#12 [ffff880f8e073900] iput at ffffffff811d1056
#13 [ffff880f8e073930] __dentry_kill at ffffffff811cc605
#14 [ffff880f8e073958] dput at ffffffff811cc833
#15 [ffff880f8e073980] ll_dirty_page_discard_warn at ffffffffa090f033 [lustre]
#16 [ffff880f8e0739f8] vvp_page_completion_write at ffffffffa0940f49 [lustre]
#17 [ffff880f8e073a30] cl_page_completion at ffffffffa05c0a33 [obdclass]
#18 [ffff880f8e073a78] osc_ap_completion at ffffffffa0a136ec [osc]
#19 [ffff880f8e073ac0] osc_extent_finish at ffffffffa0a18db1 [osc]
#20 [ffff880f8e073bc0] brw_interpret at ffffffffa09ff29b [osc]
#21 [ffff880f8e073cb8] ptlrpc_check_set at ffffffffa0700233 [ptlrpc]
#22 [ffff880f8e073d80] ptlrpcd_check at ffffffffa072cf1a [ptlrpc]
#23 [ffff880f8e073dc8] ptlrpcd at ffffffffa072d3a5 [ptlrpc
#24 [ffff880f8e073ec8] kthread at ffffffff81078646
#25 [ffff880f8e073f50] ret_from_fork at ffffffff81520e8f

What's going on:

The client is evicted by OST0006 while it has an outstanding bulk write rpc in progress. When the ptlrpcd daemon handles the bulk write reply, it sees the EIO (-5) error and issues the warning that dirty pages are being discarded (ll_dirty_page_discard_warn()). Lustre attempts to include the file name in the warning, but at the point the warning is issued, the ptlrpcd thread only has access to the page and its inode. It must therefore do a path lookup, getting references on the dentry and inode in the process. While it holds these references, all other references are dropped, so when the ptlrpcd thread releases its references it is responsible for deleting the inode and flushing the dirty pages associated with the inode. But the dirty pages are already being written by the failed bulk rpc, so the ptlrpcd daemon deadlocks itself.

The cleanest fix is to remove the file name from the warning message. The message contains the fid so an administrator can still identify the file. Furthermore, in most real life cases of this warning (even those that deadlock), the file name is never successfully retrieved.



 Comments   
Comment by Gerrit Updater [ 08/Jul/19 ]

Ann Koehler (amk@cray.com) uploaded a new patch: https://review.whamcloud.com/35443
Subject: LU-12522 llite: Remove fname from dirty page discard warning
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bcb448be2b42e94db4930f9614b532d5bb564fa2

Comment by Andreas Dilger [ 08/Jul/19 ]

We were asked by a customer to add the filename to these messages in LU-921, to make it easier for the admin to determine which files are affected when a client is evicted. Is there some other way that we could detect the circular dependency (e.g. mark the thread or inode as already being processed by ll_dirty_page_discard_warn() and avoid the freeing, or get an extra dentry or inode reference in osc_build_rpc() so that the last reference is always put in brw_interpret() after ll_dirty_page_discard_warn() is finished)?

Furthermore, in most real life cases of this warning (even those that deadlock), the file name is never successfully retrieved.

could you please explain that a bit further? Is that because nobody cares to read the console error messages, or because the node is rebooted shortly after being evicted, or something else?

Comment by Ann Koehler (Inactive) [ 09/Jul/19 ]

Furthermore, in most real life cases of this warning (even those that deadlock), the file name is never successfully retrieved.

could you please explain that a bit further? Is that because nobody cares to read the console error messages, or because the node is rebooted shortly after being evicted, or something else?

In hundreds of actual warning messages that I've seen in console logs, I can't remember seeing one in which the filename is not empty. I've been paying careful attention for the last year or so and in that time, I have not seen a single one of these warnings that had a nonempty filename. Not a scientific study by any means, but I do see a lot of log files. When this bug was first identified, about 1.5 years ago, I thought that the deadlock was rare enough and filename useful enough that we should just live with the bug. But since then, we've seen half a dozen or so occurrences of the deadlock and I've noted that filename is often/always null.

And yes, there's likely a way to preserve the file name in the warning. Another approach would be to acquire the file name further up the stack and stash it a way for use in this message. Just seemed like overkill for info that doesn't seem to get set in most cases.

Comment by Patrick Farrell (Inactive) [ 09/Jul/19 ]

amk, this further discussion is bringing back some temporarily buried memories here.  In the case of the deadlock, there is no file name because the deadlock is with the file being deleted.  Do you know what the log looks like in non-deadlock cases?  I think the message probably works in those cases...?

Comment by Ann Koehler (Inactive) [ 09/Jul/19 ]

Do you know what the log looks like in non-deadlock cases? I think the message probably works in those cases...?

Not that I've seen. The warnings occur following evictions. I've been watching those and haven't seen file names even without the deadlock. Again not scientific, but informs my opinion.

Given the pushback on removing the file name, I'm inclined to withdraw my patch and leave the deadlock as an open issue in the back log.

Comment by Andreas Dilger [ 13/Jul/19 ]

Another approach would be to acquire the file name further up the stack and stash it a way for use in this message. Just seemed like overkill for info that doesn't seem to get set in most cases.

Definitely it doesn't make sense to add work to the common case that is only useful for a very uncommon case.

Comment by Andreas Dilger [ 13/Jul/19 ]

I looked through a bunch of older logs that included ll_dirty_page_discard_warn(), and while there are several that have valid filenames, those are all with Lustre 2.5.x. All of the more recent messages show "[FID]:/" as the pathname, so I agree it isn't very useful as it stands.

Being able to run "lfs fid2path" on the FID afterward is OK for people who know Lustre, but it doesn't help for many users, and will fail for files that no longer exist. In the latter case (file was already deleted), it isn't clear if there is much value to investigate some issue if the file no longer exists.

Comment by Ann Koehler (Inactive) [ 29/Oct/19 ]

See LU-12915 for a different approach to reporting the names of files that have had dirty pages discarded.

Generated at Sat Feb 10 02:53:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.