[LU-10] Client stuck in ptlrpc_invalidate_import() after eviction Created: 26/Oct/10  Updated: 28/Jun/11  Resolved: 13/Jun/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Christopher Morrone Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre 1.8.3.0-5chaos


Attachments: File b_21760.diff    
Severity: 4
Rank (Obsolete): 10250

 Description   

We had a client get stuck in ptlrpc_invalidate_import() after it was evicted. Info will be limited since it was on the secure network.

On the console, the client is printing this every ten minutes:

ptlrpc_invalidate_import()) ls3-OST01c4_UUID: rc = -110 waiting for callback
(1 != 0)
ptlrpc_invalidate_import()) Skipped 5 previous similar messages
ptlrpc_invalidate_import()) @@@ still on sending list req@<hex> x<xid>/t0
o4->ls3-OST01c4_UUID@<ip>@tcp:6/4 len 448/608 e 5 to 1 dl <time> ref 2 fl
Unregistering:ES/0/0 rc -4/0
ptlrpc_invalidate_import()) Skipped 5 previous similar messages
ptlrpc_invalidate_import()) ls3-OST01c4_UUID: RPCs in "Unregistering" phase
found (1). Network is sluggish? Waiting them to error out.
ptlrpc_invalidate_import()) Skipped 5 previous similar messages

and it is the ll_imp_inval thread that appears to be looping indefinitely (it was printing that for well over a month before I was alerted to the problem).

The thread "ldlm_bl_11" was stuck in sync_page(), with the following backtrace:

schedule
io_schedule
sync_page
__wait_on_bit_lock
__lock_page
ll_page_removal_cb
cache_remove_lock
lock_handle_addref
class_handle2object
ldlm_cli_cancel_local
ldlm_cli_cancel
osc_extent_blocking_cb
ldlm_handle_bl_callback
ldlm_bl_thread_main

Whether that is symptom or cause for the hung import invalidate, I do not know.



 Comments   
Comment by Sam Bigger (Inactive) [ 26/Oct/10 ]

FYI, Problem is being actively looked at now.

Comment by Robert Read (Inactive) [ 27/Oct/10 ]

Lai, please look into this.

Chris, where can we get the source tree for the version being used in production?

Comment by Lai Siyao [ 27/Oct/10 ]

Chris, could you get the backtrace of all processes on that machine? I want to know which process may have locked the page to be removed by ldlm_bl_11.

Comment by Christopher Morrone [ 27/Oct/10 ]

The source is available here:

http://github.com/morrone/lustre

I typed in the the only backtrace that was interesting. No other processes have a backtrace that explains why the lock is held. Everything else was pretty much in normal idle state.

Comment by Robert Read (Inactive) [ 27/Oct/10 ]

I suspect ldlm_bl_11 is waiting for the same rpc that the invalidate thread is waiting for, so this is probably a symptom.

Comment by Lai Siyao [ 28/Oct/10 ]

What I can tell from the messages is:

  • import is still in EVICTED state.
  • the req in sending_list is an OST_WRITE. Is it always in RQ_PHASE_UNREGISTERING? If so, it means ptlrpc_bulk_desc->bd_network_rw is 1.
  • ldlm_bl_11 is stuck in sync_page() because the above request is not complete, and page is still locked by that request.

So as the log message suggested "Network is sluggish?", I will continue checking the code why ptlrpc_bulk_desc->bd_network_rw is 1 under this condition.

Comment by Lai Siyao [ 31/Oct/10 ]

I believe this is the same bug of https://bugzilla.lustre.org/show_bug.cgi?id=21760

And https://bugzilla.lustre.org/attachment.cgi?id=30963 is the patch, it looks working, but not landed yet.

Comment by Robert Read (Inactive) [ 02/Nov/10 ]

Lai, that patch has been backed out of the tree (that's what the - flags are for). However, the new attachment looks promising: https://bugzilla.lustre.org/attachment.cgi?id=32032

Comment by Lai Siyao [ 02/Nov/10 ]

Robert, though the original patch is reverted, I think it's correct, and Dmitry will continue discussing with Johann. As for the patch you mentioned, it's needless and has been discarded according to the latest update on bugzilla.

Comment by Dan Ferber (Inactive) [ 09/Nov/10 ]

Lai, can you post your test results and any other thoughts to the BZ bug, as that would help Dimitry, Oleg, and Cory, and maybe note in this bug that they've been posted there. Do you still think, as Dimitry does, that the patch in the bug will fix this problem?

Comment by Lai Siyao [ 10/Nov/10 ]

I think the root cause of this bug is not we forget to unregister bulk, but mix reply
unregistering and bulk unregistering phase together. Dmitry's patch may cause bulk unregistered
mistakenly (see code near after_reply() in ptlrpc_check_set(), it only wants to unregister reply,
but not bulk).

This patch b_21760.diff adds a new phase REQ_PHASE_BULK_UNREGISTERING, and request in REQ_PHASE_UNREGISTERING
will only wait for reply unregistered, while request in REQ_PHASE_BULK_UNREGISTERING waits for bulk
unregistered.

Comment by Dan Ferber (Inactive) [ 12/Nov/10 ]

Lai, are you ready for Chris to test your attached patch?

Comment by Lai Siyao [ 12/Nov/10 ]

Yes, this patch should be able to fix the symptom listed above; and for bug 21760, it may involve other bugs, I will continue looking into that.

Comment by Lai Siyao [ 16/Nov/10 ]

This patch has problem in handling expired request; and Johan thinks it's too big a change and maybe too intrusive, he will propose a patch later.

Comment by Lai Siyao [ 17/Nov/10 ]

The patch I proposed will cause problem upon network errors, and Johann said he will provide a less intrusive patch, it's better to wait for Johann's fix and then start testing.

Comment by Lai Siyao [ 19/Nov/10 ]

Johann provided a patch, but I think it may be incomplete (he said he will rethink it), however it can fix the symptom described above. So it's okay to starting testing with Johann's patch now.

Comment by Christopher Morrone [ 01/Dec/10 ]

Johann landed the patch on b1_8 for 1.8.6. I will pull it into the llnl branch.

Comment by Lai Siyao [ 27/Jan/11 ]

Hi Chris, did you see this failure again after landing? if not, can we close this issue?

Comment by Christopher Morrone [ 28/Jan/11 ]

It was landed, but the code hasn't made it onto production clusters yet. It is rolling out with a release now.

Comment by Peter Jones [ 13/Jun/11 ]

This has been running in production for a while so I think that it is safe to mark it as resolved. Please reopen if this reoccurs

Generated at Sat Feb 10 01:02:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.