Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4309

mds_intent_policy ASSERTION(new_lock != NULL) failed: op 0x8 lockh 0x0

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 1.8.9
    • None
    • RHEL 5.9/distro IB
    • 2
    • 11801

    Description

      An MDT thread hit an assertion in mds_intent_policy in what otherwise appeared to be normal operation.

      I'm attaching the kernel log messages after the LBUG. These are from the console. We have a crash dump from the node, but no lustre log files.

      Lustre build:
      Nov 18 12:46:55 widow-mds2 kernel: [ 387.597792] Lustre: Build Version: v1_8_9_WC1--CHANGED-2.6.18-348.3.1.el5.widow

      Attachments

        Activity

          [LU-4309] mds_intent_policy ASSERTION(new_lock != NULL) failed: op 0x8 lockh 0x0

          Thank you for the update. I will close this ticket.

          jamesanunez James Nunez (Inactive) added a comment - Thank you for the update. I will close this ticket.

          Let's close it.

          blakecaldwell Blake Caldwell added a comment - Let's close it.

          So this filesystem is out of production (in a hold state before decommissioning); my assertion is that we should go ahead and close this issue – even if we integrated the patch and ran the storage system with it for a while it would never get any client access and likely would not exercise the code path for the patch. Any objections?


          -Jason

          hilljjornl Jason Hill (Inactive) added a comment - So this filesystem is out of production (in a hold state before decommissioning); my assertion is that we should go ahead and close this issue – even if we integrated the patch and ran the storage system with it for a while it would never get any client access and likely would not exercise the code path for the patch. Any objections? – -Jason

          Blake, Thanks for the update.

          jamesanunez James Nunez (Inactive) added a comment - Blake, Thanks for the update.

          I haven't been able to apply this debug patch yet. The system has been stable, and as a result we haven't had an unschedule outage to apply that patch. So nothing at this time. I will try applying the debug patch to another system that we can take down sooner.

          blakecaldwell Blake Caldwell added a comment - I haven't been able to apply this debug patch yet. The system has been stable, and as a result we haven't had an unschedule outage to apply that patch. So nothing at this time. I will try applying the debug patch to another system that we can take down sooner.

          Blake,

          Are you still seeing this assertion on your systems? If so, were you able to apply the patch to collect more information?

          Thanks,
          James

          jamesanunez James Nunez (Inactive) added a comment - Blake, Are you still seeing this assertion on your systems? If so, were you able to apply the patch to collect more information? Thanks, James
          laisiyao Lai Siyao added a comment -

          I am not able to find the problem in the code, and I composed a debug patch to dump request before this assert, could you apply it and to get more info upon this failure again?

          laisiyao Lai Siyao added a comment - I am not able to find the problem in the code, and I composed a debug patch to dump request before this assert, could you apply it and to get more info upon this failure again?
          laisiyao Lai Siyao added a comment -

          Hmm, there is not much we can do in this case IMO, since MDS crash will cause all system hang, and it's hard to trace back to the client. I'll do more reviews on related code to understand this assert.

          laisiyao Lai Siyao added a comment - Hmm, there is not much we can do in this case IMO, since MDS crash will cause all system hang, and it's hard to trace back to the client. I'll do more reviews on related code to understand this assert.

          Without the lustre logs in /tmp, I won't be able to track down the client. Even if the client could be identified from the crash dump, then there is the problem of identifying what it was doing at the time.

          I see that the dmesg output is not very helpful, but that's all I have other than a crash dump.

          So that we are better prepared for these cases in the future, what information can be collected on the server side beside /tmp/lustre.*? Collecting client debug logs is very difficult due to the number of clients. Would a ldlm_namespace_dump be helpful? If the LBUG has already occurred are there any debug flags for /proc/sys/lnet/debug that would provide useful information? Since the offending request has already been made, does capturing +net +dlmtrace +rpctrace do any good?

          blakecaldwell Blake Caldwell added a comment - Without the lustre logs in /tmp, I won't be able to track down the client. Even if the client could be identified from the crash dump, then there is the problem of identifying what it was doing at the time. I see that the dmesg output is not very helpful, but that's all I have other than a crash dump. So that we are better prepared for these cases in the future, what information can be collected on the server side beside /tmp/lustre.*? Collecting client debug logs is very difficult due to the number of clients. Would a ldlm_namespace_dump be helpful? If the LBUG has already occurred are there any debug flags for /proc/sys/lnet/debug that would provide useful information? Since the offending request has already been made, does capturing +net +dlmtrace +rpctrace do any good?
          laisiyao Lai Siyao added a comment -

          Could you know which client getattr cause this ASSERT? If so, can you check the backtrace of the process on the client that is doing getattr?

          laisiyao Lai Siyao added a comment - Could you know which client getattr cause this ASSERT? If so, can you check the backtrace of the process on the client that is doing getattr?

          People

            laisiyao Lai Siyao
            blakecaldwell Blake Caldwell
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: