[LU-10] Client stuck in ptlrpc_invalidate_import() after eviction Created: 26/Oct/10 Updated: 28/Jun/11 Resolved: 13/Jun/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Christopher Morrone | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre 1.8.3.0-5chaos |
||
| Attachments: |
|
| Severity: | 4 |
| Rank (Obsolete): | 10250 |
| Description |
|
We had a client get stuck in ptlrpc_invalidate_import() after it was evicted. Info will be limited since it was on the secure network. On the console, the client is printing this every ten minutes: ptlrpc_invalidate_import()) ls3-OST01c4_UUID: rc = -110 waiting for callback and it is the ll_imp_inval thread that appears to be looping indefinitely (it was printing that for well over a month before I was alerted to the problem). The thread "ldlm_bl_11" was stuck in sync_page(), with the following backtrace: schedule Whether that is symptom or cause for the hung import invalidate, I do not know. |
| Comments |
| Comment by Sam Bigger (Inactive) [ 26/Oct/10 ] |
|
FYI, Problem is being actively looked at now. |
| Comment by Robert Read (Inactive) [ 27/Oct/10 ] |
|
Lai, please look into this. Chris, where can we get the source tree for the version being used in production? |
| Comment by Lai Siyao [ 27/Oct/10 ] |
|
Chris, could you get the backtrace of all processes on that machine? I want to know which process may have locked the page to be removed by ldlm_bl_11. |
| Comment by Christopher Morrone [ 27/Oct/10 ] |
|
The source is available here: http://github.com/morrone/lustre I typed in the the only backtrace that was interesting. No other processes have a backtrace that explains why the lock is held. Everything else was pretty much in normal idle state. |
| Comment by Robert Read (Inactive) [ 27/Oct/10 ] |
|
I suspect ldlm_bl_11 is waiting for the same rpc that the invalidate thread is waiting for, so this is probably a symptom. |
| Comment by Lai Siyao [ 28/Oct/10 ] |
|
What I can tell from the messages is:
So as the log message suggested "Network is sluggish?", I will continue checking the code why ptlrpc_bulk_desc->bd_network_rw is 1 under this condition. |
| Comment by Lai Siyao [ 31/Oct/10 ] |
|
I believe this is the same bug of https://bugzilla.lustre.org/show_bug.cgi?id=21760 And https://bugzilla.lustre.org/attachment.cgi?id=30963 is the patch, it looks working, but not landed yet. |
| Comment by Robert Read (Inactive) [ 02/Nov/10 ] |
|
Lai, that patch has been backed out of the tree (that's what the - flags are for). However, the new attachment looks promising: https://bugzilla.lustre.org/attachment.cgi?id=32032 |
| Comment by Lai Siyao [ 02/Nov/10 ] |
|
Robert, though the original patch is reverted, I think it's correct, and Dmitry will continue discussing with Johann. As for the patch you mentioned, it's needless and has been discarded according to the latest update on bugzilla. |
| Comment by Dan Ferber (Inactive) [ 09/Nov/10 ] |
|
Lai, can you post your test results and any other thoughts to the BZ bug, as that would help Dimitry, Oleg, and Cory, and maybe note in this bug that they've been posted there. Do you still think, as Dimitry does, that the patch in the bug will fix this problem? |
| Comment by Lai Siyao [ 10/Nov/10 ] |
|
I think the root cause of this bug is not we forget to unregister bulk, but mix reply This patch b_21760.diff |
| Comment by Dan Ferber (Inactive) [ 12/Nov/10 ] |
|
Lai, are you ready for Chris to test your attached patch? |
| Comment by Lai Siyao [ 12/Nov/10 ] |
|
Yes, this patch should be able to fix the symptom listed above; and for bug 21760, it may involve other bugs, I will continue looking into that. |
| Comment by Lai Siyao [ 16/Nov/10 ] |
|
This patch has problem in handling expired request; and Johan thinks it's too big a change and maybe too intrusive, he will propose a patch later. |
| Comment by Lai Siyao [ 17/Nov/10 ] |
|
The patch I proposed will cause problem upon network errors, and Johann said he will provide a less intrusive patch, it's better to wait for Johann's fix and then start testing. |
| Comment by Lai Siyao [ 19/Nov/10 ] |
|
Johann provided a patch, but I think it may be incomplete (he said he will rethink it), however it can fix the symptom described above. So it's okay to starting testing with Johann's patch now. |
| Comment by Christopher Morrone [ 01/Dec/10 ] |
|
Johann landed the patch on b1_8 for 1.8.6. I will pull it into the llnl branch. |
| Comment by Lai Siyao [ 27/Jan/11 ] |
|
Hi Chris, did you see this failure again after landing? if not, can we close this issue? |
| Comment by Christopher Morrone [ 28/Jan/11 ] |
|
It was landed, but the code hasn't made it onto production clusters yet. It is rolling out with a release now. |
| Comment by Peter Jones [ 13/Jun/11 ] |
|
This has been running in production for a while so I think that it is safe to mark it as resolved. Please reopen if this reoccurs |