Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-59

call traces on MDS for ldlm_expired_completion_wait()

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 1.8.6
    • None
    • 3
    • 10088

    Description

      The call traces happened on MDS for ldlm_expired_completion_wait() and status was changed to LBUG.
      This seems to be similar to bug 21967, but there is no patches for lustre-1.8.x, right now.
      I'm attaching the console log on MDS that we saw. Could you please find out whether this is same bug as 21967. And if yes, please back port patch in 21967 for 1.8.x, also.

      Thanks
      Ihara

      Attachments

        Activity

          [LU-59] call traces on MDS for ldlm_expired_completion_wait()
          pjones Peter Jones added a comment -

          Great - thanks Ihara!

          pjones Peter Jones added a comment - Great - thanks Ihara!

          fine. as far as we can see, the problem seems to be fixed by patch in b23352.

          thanks!

          ihara Shuichi Ihara (Inactive) added a comment - fine. as far as we can see, the problem seems to be fixed by patch in b23352. thanks!
          pjones Peter Jones added a comment -

          Ihara

          Do you have any further questions or can we close out this ticket?

          Thanks

          Peter

          pjones Peter Jones added a comment - Ihara Do you have any further questions or can we close out this ticket? Thanks Peter

          With adaptive timeout, the lock callback timeout could be very long (the default maximum is 600 seconds) and consequently the server working thread might wait in ldlm_cli_enqueue_local() for a very long time, which triggered the watchdog to dump the stack trace in the end. So I think it's not necessary a bug.

          niu Niu Yawei (Inactive) added a comment - With adaptive timeout, the lock callback timeout could be very long (the default maximum is 600 seconds) and consequently the server working thread might wait in ldlm_cli_enqueue_local() for a very long time, which triggered the watchdog to dump the stack trace in the end. So I think it's not necessary a bug.

          Niu,

          We haven't seen same issue since applied a patch in b23352.
          But, don't we have any fixes in adaptive timeout area?

          ihara Shuichi Ihara (Inactive) added a comment - Niu, We haven't seen same issue since applied a patch in b23352. But, don't we have any fixes in adaptive timeout area?

          Hi, Ihara

          What about the test result after b23352 fix applied?

          niu Niu Yawei (Inactive) added a comment - Hi, Ihara What about the test result after b23352 fix applied?

          Thank you, Ihara. I think the fix in LU-27 doesn't help much on this issue, whereas the at_min fix in b23352 can avoid the unnecessary client eviction, which might reduce the chance of seeing such error message, I think you can try this patch to see if things will be getting better.

          What I don't understand is that why the lock enqueue didn't timeout in time (and triggered whatchdog at last), will make further investigaion on the Adaptive Timeout and get back to you later.

          niu Niu Yawei (Inactive) added a comment - Thank you, Ihara. I think the fix in LU-27 doesn't help much on this issue, whereas the at_min fix in b23352 can avoid the unnecessary client eviction, which might reduce the chance of seeing such error message, I think you can try this patch to see if things will be getting better. What I don't understand is that why the lock enqueue didn't timeout in time (and triggered whatchdog at last), will make further investigaion on the Adaptive Timeout and get back to you later.

          Niu, sorry, there are several MDS crashes (or hang) frequently at the customer site. One of reasons might be LU-27 which caused LBUG. Also, we might hit bug 23352 on TCP clients. I saw these problems sometimes happened at the same time. So, I thought this case was also LBUG, but you are right. there are nothing LBUG in this console messages.

          We applied the patch in LU-27 and bug 23352 and are keeping an eye if we still see same problem even after apply these patches. Do you think this error messages can be caused after by LU-27 or bug 23352?

          ihara Shuichi Ihara (Inactive) added a comment - Niu, sorry, there are several MDS crashes (or hang) frequently at the customer site. One of reasons might be LU-27 which caused LBUG. Also, we might hit bug 23352 on TCP clients. I saw these problems sometimes happened at the same time. So, I thought this case was also LBUG, but you are right. there are nothing LBUG in this console messages. We applied the patch in LU-27 and bug 23352 and are keeping an eye if we still see same problem even after apply these patches. Do you think this error messages can be caused after by LU-27 or bug 23352?

          Hi, Ihara

          "The call traces happened on MDS for ldlm_expired_completion_wait() and status was changed to LBUG"
          I don't quite understand the "status was changed to LBUG", could you make further explanation?

          The log shows that lots of server threads were blocking on local lock enqueue for a long time, which triggered watchdog to dump the stack traces. I suspect the reason is that some client which holding locks was evicted by server (maybe a dead or hang client), and the server local lock enqueue triggered blocking ast to the evicted client, the blocking ast should be timeout soon, however for some reason, the blocking ast didn't expired in time, which making server threads waiting for a long time and the watchdog was triggered at the end.

          Is it easy to reproduce? If it's easy, could you turn off the Adaptive Timeout to see if the problem is gone? Thanks.

          niu Niu Yawei (Inactive) added a comment - Hi, Ihara "The call traces happened on MDS for ldlm_expired_completion_wait() and status was changed to LBUG" I don't quite understand the "status was changed to LBUG", could you make further explanation? The log shows that lots of server threads were blocking on local lock enqueue for a long time, which triggered watchdog to dump the stack traces. I suspect the reason is that some client which holding locks was evicted by server (maybe a dead or hang client), and the server local lock enqueue triggered blocking ast to the evicted client, the blocking ast should be timeout soon, however for some reason, the blocking ast didn't expired in time, which making server threads waiting for a long time and the watchdog was triggered at the end. Is it easy to reproduce? If it's easy, could you turn off the Adaptive Timeout to see if the problem is gone? Thanks.

          There are some statistic/diagnostic patches and one 'disable COS by default' patch in bug 22598, and I don't think 1.8 has COS, so the patches in 22598 might not helpful for this issue.

          Yes, I meant adaptive timeout.Thank you.

          niu Niu Yawei (Inactive) added a comment - There are some statistic/diagnostic patches and one 'disable COS by default' patch in bug 22598, and I don't think 1.8 has COS, so the patches in 22598 might not helpful for this issue. Yes, I meant adaptive timeout.Thank you.

          People

            niu Niu Yawei (Inactive)
            ihara Shuichi Ihara (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: