[LU-59] call traces on MDS for ldlm_expired_completion_wait() Created: 04/Feb/11 Updated: 28/Jun/11 Resolved: 13/Jun/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 10088 |
| Description |
|
The call traces happened on MDS for ldlm_expired_completion_wait() and status was changed to LBUG. Thanks |
| Comments |
| Comment by Peter Jones [ 04/Feb/11 ] |
|
Niu Could you please look into this one? Thanks Peter |
| Comment by Niu Yawei (Inactive) [ 07/Feb/11 ] |
|
Hi, Ihara It looks similiar to the bug 21967, the difference is the timeout value in this trace is much longer than 21967's. But I didn't see any patches on bug 21967, seems it's an unresolved issue. BTW: In your test, is the dynamic timeout feature enabled? |
| Comment by Shuichi Ihara (Inactive) [ 07/Feb/11 ] |
|
Ah, 21967 was closed by some reasons. I thought the problem is related to bug 22598 which there are no patches for 1.8.x branch. The dynamic timeout feature means Adaptive Timeout? If so, yes, I didn't disable AT intentionally. So, it should be enabled by default. |
| Comment by Niu Yawei (Inactive) [ 07/Feb/11 ] |
|
There are some statistic/diagnostic patches and one 'disable COS by default' patch in bug 22598, and I don't think 1.8 has COS, so the patches in 22598 might not helpful for this issue. Yes, I meant adaptive timeout.Thank you. |
| Comment by Niu Yawei (Inactive) [ 08/Feb/11 ] |
|
Hi, Ihara "The call traces happened on MDS for ldlm_expired_completion_wait() and status was changed to LBUG" The log shows that lots of server threads were blocking on local lock enqueue for a long time, which triggered watchdog to dump the stack traces. I suspect the reason is that some client which holding locks was evicted by server (maybe a dead or hang client), and the server local lock enqueue triggered blocking ast to the evicted client, the blocking ast should be timeout soon, however for some reason, the blocking ast didn't expired in time, which making server threads waiting for a long time and the watchdog was triggered at the end. Is it easy to reproduce? If it's easy, could you turn off the Adaptive Timeout to see if the problem is gone? Thanks. |
| Comment by Shuichi Ihara (Inactive) [ 10/Feb/11 ] |
|
Niu, sorry, there are several MDS crashes (or hang) frequently at the customer site. One of reasons might be We applied the patch in |
| Comment by Niu Yawei (Inactive) [ 10/Feb/11 ] |
|
Thank you, Ihara. I think the fix in What I don't understand is that why the lock enqueue didn't timeout in time (and triggered whatchdog at last), will make further investigaion on the Adaptive Timeout and get back to you later. |
| Comment by Niu Yawei (Inactive) [ 20/Feb/11 ] |
|
Hi, Ihara What about the test result after b23352 fix applied? |
| Comment by Shuichi Ihara (Inactive) [ 21/Feb/11 ] |
|
Niu, We haven't seen same issue since applied a patch in b23352. |
| Comment by Niu Yawei (Inactive) [ 21/Feb/11 ] |
|
With adaptive timeout, the lock callback timeout could be very long (the default maximum is 600 seconds) and consequently the server working thread might wait in ldlm_cli_enqueue_local() for a very long time, which triggered the watchdog to dump the stack trace in the end. So I think it's not necessary a bug. |
| Comment by Peter Jones [ 13/Jun/11 ] |
|
Ihara Do you have any further questions or can we close out this ticket? Thanks Peter |
| Comment by Shuichi Ihara (Inactive) [ 13/Jun/11 ] |
|
fine. as far as we can see, the problem seems to be fixed by patch in b23352. thanks! |
| Comment by Peter Jones [ 13/Jun/11 ] |
|
Great - thanks Ihara! |