[LU-59] call traces on MDS for ldlm_expired_completion_wait() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 1.8.6
Labels:
None

Severity:
3
Rank (Obsolete):
10088

Description

The call traces happened on MDS for ldlm_expired_completion_wait() and status was changed to LBUG.
This seems to be similar to bug 21967, but there is no patches for lustre-1.8.x, right now.
I'm attaching the console log on MDS that we saw. Could you please find out whether this is same bug as 21967. And if yes, please back port patch in 21967 for 1.8.x, also.

Thanks
Ihara

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

t2s007019.console_log
368 kB
04/Feb/11 2:47 AM

Activity

[LU-59] call traces on MDS for ldlm_expired_completion_wait()

Peter Jones added a comment - 13/Jun/11 7:32 PM

Great - thanks Ihara!

Peter Jones added a comment - 13/Jun/11 7:32 PM Great - thanks Ihara!

Shuichi Ihara (Inactive) added a comment - 13/Jun/11 7:26 PM

fine. as far as we can see, the problem seems to be fixed by patch in b23352.

thanks!

Shuichi Ihara (Inactive) added a comment - 13/Jun/11 7:26 PM fine. as far as we can see, the problem seems to be fixed by patch in b23352. thanks!

Peter Jones added a comment - 13/Jun/11 1:55 PM

Ihara

Do you have any further questions or can we close out this ticket?

Thanks

Peter

Peter Jones added a comment - 13/Jun/11 1:55 PM Ihara Do you have any further questions or can we close out this ticket? Thanks Peter

Niu Yawei (Inactive) added a comment - 21/Feb/11 12:37 AM

With adaptive timeout, the lock callback timeout could be very long (the default maximum is 600 seconds) and consequently the server working thread might wait in ldlm_cli_enqueue_local() for a very long time, which triggered the watchdog to dump the stack trace in the end. So I think it's not necessary a bug.

Niu Yawei (Inactive) added a comment - 21/Feb/11 12:37 AM With adaptive timeout, the lock callback timeout could be very long (the default maximum is 600 seconds) and consequently the server working thread might wait in ldlm_cli_enqueue_local() for a very long time, which triggered the watchdog to dump the stack trace in the end. So I think it's not necessary a bug.

Shuichi Ihara (Inactive) added a comment - 21/Feb/11 12:22 AM

Niu,

We haven't seen same issue since applied a patch in b23352.
But, don't we have any fixes in adaptive timeout area?

Shuichi Ihara (Inactive) added a comment - 21/Feb/11 12:22 AM Niu, We haven't seen same issue since applied a patch in b23352. But, don't we have any fixes in adaptive timeout area?

Niu Yawei (Inactive) added a comment - 20/Feb/11 10:03 PM

Hi, Ihara

What about the test result after b23352 fix applied?

Niu Yawei (Inactive) added a comment - 20/Feb/11 10:03 PM Hi, Ihara What about the test result after b23352 fix applied?

Niu Yawei (Inactive) added a comment - 10/Feb/11 6:42 AM

Thank you, Ihara. I think the fix in ~~LU-27~~ doesn't help much on this issue, whereas the at_min fix in b23352 can avoid the unnecessary client eviction, which might reduce the chance of seeing such error message, I think you can try this patch to see if things will be getting better.

What I don't understand is that why the lock enqueue didn't timeout in time (and triggered whatchdog at last), will make further investigaion on the Adaptive Timeout and get back to you later.

Niu Yawei (Inactive) added a comment - 10/Feb/11 6:42 AM Thank you, Ihara. I think the fix in LU-27 doesn't help much on this issue, whereas the at_min fix in b23352 can avoid the unnecessary client eviction, which might reduce the chance of seeing such error message, I think you can try this patch to see if things will be getting better. What I don't understand is that why the lock enqueue didn't timeout in time (and triggered whatchdog at last), will make further investigaion on the Adaptive Timeout and get back to you later.

Shuichi Ihara (Inactive) added a comment - 10/Feb/11 5:58 AM

Niu, sorry, there are several MDS crashes (or hang) frequently at the customer site. One of reasons might be ~~LU-27~~ which caused LBUG. Also, we might hit bug 23352 on TCP clients. I saw these problems sometimes happened at the same time. So, I thought this case was also LBUG, but you are right. there are nothing LBUG in this console messages.

We applied the patch in ~~LU-27~~ and bug 23352 and are keeping an eye if we still see same problem even after apply these patches. Do you think this error messages can be caused after by ~~LU-27~~ or bug 23352?

Shuichi Ihara (Inactive) added a comment - 10/Feb/11 5:58 AM Niu, sorry, there are several MDS crashes (or hang) frequently at the customer site. One of reasons might be LU-27 which caused LBUG. Also, we might hit bug 23352 on TCP clients. I saw these problems sometimes happened at the same time. So, I thought this case was also LBUG, but you are right. there are nothing LBUG in this console messages. We applied the patch in LU-27 and bug 23352 and are keeping an eye if we still see same problem even after apply these patches. Do you think this error messages can be caused after by LU-27 or bug 23352?

Niu Yawei (Inactive) added a comment - 08/Feb/11 11:38 PM

Hi, Ihara

"The call traces happened on MDS for ldlm_expired_completion_wait() and status was changed to LBUG"
I don't quite understand the "status was changed to LBUG", could you make further explanation?

The log shows that lots of server threads were blocking on local lock enqueue for a long time, which triggered watchdog to dump the stack traces. I suspect the reason is that some client which holding locks was evicted by server (maybe a dead or hang client), and the server local lock enqueue triggered blocking ast to the evicted client, the blocking ast should be timeout soon, however for some reason, the blocking ast didn't expired in time, which making server threads waiting for a long time and the watchdog was triggered at the end.

Is it easy to reproduce? If it's easy, could you turn off the Adaptive Timeout to see if the problem is gone? Thanks.

Niu Yawei (Inactive) added a comment - 08/Feb/11 11:38 PM Hi, Ihara "The call traces happened on MDS for ldlm_expired_completion_wait() and status was changed to LBUG" I don't quite understand the "status was changed to LBUG", could you make further explanation? The log shows that lots of server threads were blocking on local lock enqueue for a long time, which triggered watchdog to dump the stack traces. I suspect the reason is that some client which holding locks was evicted by server (maybe a dead or hang client), and the server local lock enqueue triggered blocking ast to the evicted client, the blocking ast should be timeout soon, however for some reason, the blocking ast didn't expired in time, which making server threads waiting for a long time and the watchdog was triggered at the end. Is it easy to reproduce? If it's easy, could you turn off the Adaptive Timeout to see if the problem is gone? Thanks.

Niu Yawei (Inactive) added a comment - 07/Feb/11 10:53 PM

There are some statistic/diagnostic patches and one 'disable COS by default' patch in bug 22598, and I don't think 1.8 has COS, so the patches in 22598 might not helpful for this issue.

Yes, I meant adaptive timeout.Thank you.

Niu Yawei (Inactive) added a comment - 07/Feb/11 10:53 PM There are some statistic/diagnostic patches and one 'disable COS by default' patch in bug 22598, and I don't think 1.8 has COS, so the patches in 22598 might not helpful for this issue. Yes, I meant adaptive timeout.Thank you.

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Shuichi Ihara (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 04/Feb/11 2:47 AM

Updated:: 28/Jun/11 3:01 PM

Resolved:: 13/Jun/11 7:32 PM