Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5525

ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • None
    • MDS node, Lustre 2.4.2-14chaos, ZFS OBD
    • 3
    • 15383

    Description

      After upgrading to lustre 2.4.2-14chaos (see github.com/chaos/lustre), we soon hit the following assertion on one of our MDS nodes:

      mdt_handler.c:3652:mdt_intent_lock_replace()) ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed

      Perhaps most significantly, this tag of our lustre tree includes the patch entitled:

      LU-4584 mdt: ensure orig lock is found in hash upon resend

      James Simmons reported this assertion when he tested the LU-4584 patch, but the Bruno made the evaluation that the assertion was unrelated to the patch.

      Whether it is related or not, we need to fix the problem.

      Attachments

        Issue Links

          Activity

            [LU-5525] ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed

            This is a not a machine for which we can provide logs or crash dumps. No, we don't know the details of the various workloads going on at the time.

            The filesystem has 768 OSTs on 768 OSS nodes. Default stripe value of 1. Some users do use stripe counts at various widths, some over 700.

            I don't know what your notes say, but I checked and we crashed the MDS fifteen times during the three days that we were running with the LU-4584 patch. We then rebooted the MDS onto Lustre version 2.4.2-14.1chaos (which does nothing but revert the LU-4584 patch), and we have not seen this ticket's assertion in over three days.

            While not conclusive, the circumstantial evidence points strongly at the LU-4584 patch either introducing the bug or making it much easier to hit.

            Here is the list of 2.4.2-14chaos to 2.4.2-14.1chaos changes:

            $ git log --oneline 2.4.2-14chaos..2.4.2-14.1chaos
            f28b8cc Revert "LU-4584 mdt: ensure orig lock is found in hash upon resend"
            

            You can find those tags at github.com/chaos/lustre.

            morrone Christopher Morrone (Inactive) added a comment - - edited This is a not a machine for which we can provide logs or crash dumps. No, we don't know the details of the various workloads going on at the time. The filesystem has 768 OSTs on 768 OSS nodes. Default stripe value of 1. Some users do use stripe counts at various widths, some over 700. I don't know what your notes say, but I checked and we crashed the MDS fifteen times during the three days that we were running with the LU-4584 patch. We then rebooted the MDS onto Lustre version 2.4.2-14.1chaos (which does nothing but revert the LU-4584 patch), and we have not seen this ticket's assertion in over three days. While not conclusive, the circumstantial evidence points strongly at the LU-4584 patch either introducing the bug or making it much easier to hit. Here is the list of 2.4.2-14chaos to 2.4.2-14.1chaos changes: $ git log --oneline 2.4.2-14chaos..2.4.2-14.1chaos f28b8cc Revert "LU-4584 mdt: ensure orig lock is found in hash upon resend" You can find those tags at github.com/chaos/lustre.

            Hello Chris,

            I checked my notes for LU-4584 and this LBUG occured without my patch being applied.

            I suspect this will not be possible for you, but just in case. Did you ran with some debug levels enabled in Lustre trace at the time of these 3 crashes ? And if yes, could you extract+provide at least one of them ? Or better, could you provide a crash-dump ?

            Also, and I know it is not an easy question but ..., do you have any idea about what was the work-load that may cause this ?

            I am trying to setup a platform to reproduce in-house so any details about your configuration is of interest (number of MDSs/MDTs, DNE?, number of OSSs/OSTs, OSTs indexing, default striping, ...).

            Thanks again and in advance for your help.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Chris, I checked my notes for LU-4584 and this LBUG occured without my patch being applied. I suspect this will not be possible for you, but just in case. Did you ran with some debug levels enabled in Lustre trace at the time of these 3 crashes ? And if yes, could you extract+provide at least one of them ? Or better, could you provide a crash-dump ? Also, and I know it is not an easy question but ..., do you have any idea about what was the work-load that may cause this ? I am trying to setup a platform to reproduce in-house so any details about your configuration is of interest (number of MDSs/MDTs, DNE?, number of OSSs/OSTs, OSTs indexing, default striping, ...). Thanks again and in advance for your help.

            We have hit this assertion 3 times in just over 24 hours on the MDS of our largest production filesystem. I reverted the patch for LU-4584 in the hope of getting us back into a reasonably operational state again. We should probably know in a day or two if the problem is gone.

            morrone Christopher Morrone (Inactive) added a comment - We have hit this assertion 3 times in just over 24 hours on the MDS of our largest production filesystem. I reverted the patch for LU-4584 in the hope of getting us back into a reasonably operational state again. We should probably know in a day or two if the problem is gone.

            Yes, I am investigating the long history around this issue ...

            bfaccini Bruno Faccini (Inactive) added a comment - Yes, I am investigating the long history around this issue ...
            pjones Peter Jones added a comment -

            Bruno

            Can you please advise on this ticket?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bruno Can you please advise on this ticket? Thanks Peter

            Here is the backtrace for the thread that mdt01_006 thread that hit the assertion:

            mdt_intent_lock_replace
            mdt_intent_reint
            mdt_intent_polucy
            ldlm_lock_enqueue
            ldlm_handle_enqueue0
            mdt_enqueue
            mdt_handle_common
            mds_regular_handle
            ptlrpc_server_handle_request
            ptlrpc_main
            
            
            morrone Christopher Morrone (Inactive) added a comment - - edited Here is the backtrace for the thread that mdt01_006 thread that hit the assertion: mdt_intent_lock_replace mdt_intent_reint mdt_intent_polucy ldlm_lock_enqueue ldlm_handle_enqueue0 mdt_enqueue mdt_handle_common mds_regular_handle ptlrpc_server_handle_request ptlrpc_main

            People

              bfaccini Bruno Faccini (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: