Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5525

ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • None
    • MDS node, Lustre 2.4.2-14chaos, ZFS OBD
    • 3
    • 15383

    Description

      After upgrading to lustre 2.4.2-14chaos (see github.com/chaos/lustre), we soon hit the following assertion on one of our MDS nodes:

      mdt_handler.c:3652:mdt_intent_lock_replace()) ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed

      Perhaps most significantly, this tag of our lustre tree includes the patch entitled:

      LU-4584 mdt: ensure orig lock is found in hash upon resend

      James Simmons reported this assertion when he tested the LU-4584 patch, but the Bruno made the evaluation that the assertion was unrelated to the patch.

      Whether it is related or not, we need to fix the problem.

      Attachments

        Issue Links

          Activity

            [LU-5525] ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed

            But now I encounter issues during the make step, Chris, do you use special tricks to build from your source tree ??

            Maybe? Could you be more specific about the commands you issued and the problem that you saw? If you had trouble in the liblustre part, try adding --disable-liblustre to the configure command line.

            morrone Christopher Morrone (Inactive) added a comment - But now I encounter issues during the make step, Chris, do you use special tricks to build from your source tree ?? Maybe? Could you be more specific about the commands you issued and the problem that you saw? If you had trouble in the liblustre part, try adding --disable-liblustre to the configure command line.

            Chris,
            Please, can you help+answer me about my previous request and detail any specific/procedure to build from LLNL source tree/git ?

            bfaccini Bruno Faccini (Inactive) added a comment - Chris, Please, can you help+answer me about my previous request and detail any specific/procedure to build from LLNL source tree/git ?

            James,
            It's too bad that at the time of this crash you ran with a Lustre debug mask made of only "D_IOCTL+D_NETERROR+D_WANING+D_ERROR+D_EMERG+D_HA+D_CONFIG+D_CONSOLE" and which did not contain at least "D_RPCTRACE+D_DLMTRACE" in addition ... That would have greatly helped to navigate in the crash-dump.

            bfaccini Bruno Faccini (Inactive) added a comment - James, It's too bad that at the time of this crash you ran with a Lustre debug mask made of only "D_IOCTL+D_NETERROR+D_WANING+D_ERROR+D_EMERG+D_HA+D_CONFIG+D_CONSOLE" and which did not contain at least "D_RPCTRACE+D_DLMTRACE" in addition ... That would have greatly helped to navigate in the crash-dump.

            But now I encounter issues during the make step, Chris, do you use special tricks to build from your source tree ??

            bfaccini Bruno Faccini (Inactive) added a comment - But now I encounter issues during the make step, Chris, do you use special tricks to build from your source tree ??

            Ok, found that 2.6.32-431.3.1 kernel is compatible with 2.4.2-14chaos kernel patches ...

            bfaccini Bruno Faccini (Inactive) added a comment - Ok, found that 2.6.32-431.3.1 kernel is compatible with 2.4.2-14chaos kernel patches ...

            I wanted to ask you if you tried to run your LU-4584 reproducer against it already ?

            Yes. The LU-4584 patch did appear to help with the evictions on unlink.

            morrone Christopher Morrone (Inactive) added a comment - I wanted to ask you if you tried to run your LU-4584 reproducer against it already ? Yes. The LU-4584 patch did appear to help with the evictions on unlink.

            I can report I don't see this issue with 2.5 servers. I see another problem instead.

            simmonsja James A Simmons added a comment - I can report I don't see this issue with 2.5 servers. I see another problem instead.

            Chris, can you help and tell me the kernel version you use/build with 2.4.2-14chaos, I am having problems when patching the 2.6.32-358.23.2 kernel version we use to build our latest b2_4/2.4.3-RC1 branch/tag.

            bfaccini Bruno Faccini (Inactive) added a comment - Chris, can you help and tell me the kernel version you use/build with 2.4.2-14chaos, I am having problems when patching the 2.6.32-358.23.2 kernel version we use to build our latest b2_4/2.4.3-RC1 branch/tag.

            Oops, you are right. I saw this problem with our 2.4 image which did contain the LU-4584 patch. Sorry for the confusion. I pushed the proper debug rpms this time.

            simmonsja James A Simmons added a comment - Oops, you are right. I saw this problem with our 2.4 image which did contain the LU-4584 patch. Sorry for the confusion. I pushed the proper debug rpms this time.

            Chris, I am really sorry about the fact my b2_4 patch for LU-4584 seems to cause you so many troubles.
            This LBUG already has not occured when running auto-tests and your/LU-4584 reproducer, against Jenkins build for patch-set #7 of Gerrit change #9488, and I am still unable to reproduce it when running with a local build made of tag 2.4.3-RC1 + LU-4584 patch (ie, patch-set #7 of Gerrit change #9488).
            I am currently building a Lustre version based on 2.4.2-14chaos and see if I can reproduce in-house, but I wanted to ask you if you tried to run your LU-4584 reproducer against it already ?

            bfaccini Bruno Faccini (Inactive) added a comment - Chris, I am really sorry about the fact my b2_4 patch for LU-4584 seems to cause you so many troubles. This LBUG already has not occured when running auto-tests and your/ LU-4584 reproducer, against Jenkins build for patch-set #7 of Gerrit change #9488, and I am still unable to reproduce it when running with a local build made of tag 2.4.3-RC1 + LU-4584 patch (ie, patch-set #7 of Gerrit change #9488). I am currently building a Lustre version based on 2.4.2-14chaos and see if I can reproduce in-house, but I wanted to ask you if you tried to run your LU-4584 reproducer against it already ?

            James, "crash" tool claims "WARNING: kernel version inconsistency between vmlinux and dumpfile" and gives up due to further errors ...
            Are you sure that at the time of crash you were running with this "2.6.32-431.17.1.el6.wc" kernel version for which you provided the debuginfo RPMs ? BTW, strings out from vmcore and vmcore-dmesg.txt report "2.6.32-358.23.2.el6.atlas"...

            bfaccini Bruno Faccini (Inactive) added a comment - James, "crash" tool claims "WARNING: kernel version inconsistency between vmlinux and dumpfile" and gives up due to further errors ... Are you sure that at the time of crash you were running with this "2.6.32-431.17.1.el6.wc" kernel version for which you provided the debuginfo RPMs ? BTW, strings out from vmcore and vmcore-dmesg.txt report "2.6.32-358.23.2.el6.atlas"...

            People

              bfaccini Bruno Faccini (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: