[LU-59] call traces on MDS for ldlm_expired_completion_wait() Created: 04/Feb/11  Updated: 28/Jun/11  Resolved: 13/Jun/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File t2s007019.console_log    
Severity: 3
Rank (Obsolete): 10088

 Description   

The call traces happened on MDS for ldlm_expired_completion_wait() and status was changed to LBUG.
This seems to be similar to bug 21967, but there is no patches for lustre-1.8.x, right now.
I'm attaching the console log on MDS that we saw. Could you please find out whether this is same bug as 21967. And if yes, please back port patch in 21967 for 1.8.x, also.

Thanks
Ihara



 Comments   
Comment by Peter Jones [ 04/Feb/11 ]

Niu

Could you please look into this one?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 07/Feb/11 ]

Hi, Ihara

It looks similiar to the bug 21967, the difference is the timeout value in this trace is much longer than 21967's. But I didn't see any patches on bug 21967, seems it's an unresolved issue.

BTW: In your test, is the dynamic timeout feature enabled?

Comment by Shuichi Ihara (Inactive) [ 07/Feb/11 ]

Ah, 21967 was closed by some reasons. I thought the problem is related to bug 22598 which there are no patches for 1.8.x branch.

The dynamic timeout feature means Adaptive Timeout? If so, yes, I didn't disable AT intentionally. So, it should be enabled by default.

Comment by Niu Yawei (Inactive) [ 07/Feb/11 ]

There are some statistic/diagnostic patches and one 'disable COS by default' patch in bug 22598, and I don't think 1.8 has COS, so the patches in 22598 might not helpful for this issue.

Yes, I meant adaptive timeout.Thank you.

Comment by Niu Yawei (Inactive) [ 08/Feb/11 ]

Hi, Ihara

"The call traces happened on MDS for ldlm_expired_completion_wait() and status was changed to LBUG"
I don't quite understand the "status was changed to LBUG", could you make further explanation?

The log shows that lots of server threads were blocking on local lock enqueue for a long time, which triggered watchdog to dump the stack traces. I suspect the reason is that some client which holding locks was evicted by server (maybe a dead or hang client), and the server local lock enqueue triggered blocking ast to the evicted client, the blocking ast should be timeout soon, however for some reason, the blocking ast didn't expired in time, which making server threads waiting for a long time and the watchdog was triggered at the end.

Is it easy to reproduce? If it's easy, could you turn off the Adaptive Timeout to see if the problem is gone? Thanks.

Comment by Shuichi Ihara (Inactive) [ 10/Feb/11 ]

Niu, sorry, there are several MDS crashes (or hang) frequently at the customer site. One of reasons might be LU-27 which caused LBUG. Also, we might hit bug 23352 on TCP clients. I saw these problems sometimes happened at the same time. So, I thought this case was also LBUG, but you are right. there are nothing LBUG in this console messages.

We applied the patch in LU-27 and bug 23352 and are keeping an eye if we still see same problem even after apply these patches. Do you think this error messages can be caused after by LU-27 or bug 23352?

Comment by Niu Yawei (Inactive) [ 10/Feb/11 ]

Thank you, Ihara. I think the fix in LU-27 doesn't help much on this issue, whereas the at_min fix in b23352 can avoid the unnecessary client eviction, which might reduce the chance of seeing such error message, I think you can try this patch to see if things will be getting better.

What I don't understand is that why the lock enqueue didn't timeout in time (and triggered whatchdog at last), will make further investigaion on the Adaptive Timeout and get back to you later.

Comment by Niu Yawei (Inactive) [ 20/Feb/11 ]

Hi, Ihara

What about the test result after b23352 fix applied?

Comment by Shuichi Ihara (Inactive) [ 21/Feb/11 ]

Niu,

We haven't seen same issue since applied a patch in b23352.
But, don't we have any fixes in adaptive timeout area?

Comment by Niu Yawei (Inactive) [ 21/Feb/11 ]

With adaptive timeout, the lock callback timeout could be very long (the default maximum is 600 seconds) and consequently the server working thread might wait in ldlm_cli_enqueue_local() for a very long time, which triggered the watchdog to dump the stack trace in the end. So I think it's not necessary a bug.

Comment by Peter Jones [ 13/Jun/11 ]

Ihara

Do you have any further questions or can we close out this ticket?

Thanks

Peter

Comment by Shuichi Ihara (Inactive) [ 13/Jun/11 ]

fine. as far as we can see, the problem seems to be fixed by patch in b23352.

thanks!

Comment by Peter Jones [ 13/Jun/11 ]

Great - thanks Ihara!

Generated at Sat Feb 10 01:03:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.