Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5525

ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • None
    • MDS node, Lustre 2.4.2-14chaos, ZFS OBD
    • 3
    • 15383

    Description

      After upgrading to lustre 2.4.2-14chaos (see github.com/chaos/lustre), we soon hit the following assertion on one of our MDS nodes:

      mdt_handler.c:3652:mdt_intent_lock_replace()) ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed

      Perhaps most significantly, this tag of our lustre tree includes the patch entitled:

      LU-4584 mdt: ensure orig lock is found in hash upon resend

      James Simmons reported this assertion when he tested the LU-4584 patch, but the Bruno made the evaluation that the assertion was unrelated to the patch.

      Whether it is related or not, we need to fix the problem.

      Attachments

        Issue Links

          Activity

            [LU-5525] ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed

            Ok, found that 2.6.32-431.3.1 kernel is compatible with 2.4.2-14chaos kernel patches ...

            bfaccini Bruno Faccini (Inactive) added a comment - Ok, found that 2.6.32-431.3.1 kernel is compatible with 2.4.2-14chaos kernel patches ...

            I wanted to ask you if you tried to run your LU-4584 reproducer against it already ?

            Yes. The LU-4584 patch did appear to help with the evictions on unlink.

            morrone Christopher Morrone (Inactive) added a comment - I wanted to ask you if you tried to run your LU-4584 reproducer against it already ? Yes. The LU-4584 patch did appear to help with the evictions on unlink.

            I can report I don't see this issue with 2.5 servers. I see another problem instead.

            simmonsja James A Simmons added a comment - I can report I don't see this issue with 2.5 servers. I see another problem instead.

            Chris, can you help and tell me the kernel version you use/build with 2.4.2-14chaos, I am having problems when patching the 2.6.32-358.23.2 kernel version we use to build our latest b2_4/2.4.3-RC1 branch/tag.

            bfaccini Bruno Faccini (Inactive) added a comment - Chris, can you help and tell me the kernel version you use/build with 2.4.2-14chaos, I am having problems when patching the 2.6.32-358.23.2 kernel version we use to build our latest b2_4/2.4.3-RC1 branch/tag.

            Oops, you are right. I saw this problem with our 2.4 image which did contain the LU-4584 patch. Sorry for the confusion. I pushed the proper debug rpms this time.

            simmonsja James A Simmons added a comment - Oops, you are right. I saw this problem with our 2.4 image which did contain the LU-4584 patch. Sorry for the confusion. I pushed the proper debug rpms this time.

            Chris, I am really sorry about the fact my b2_4 patch for LU-4584 seems to cause you so many troubles.
            This LBUG already has not occured when running auto-tests and your/LU-4584 reproducer, against Jenkins build for patch-set #7 of Gerrit change #9488, and I am still unable to reproduce it when running with a local build made of tag 2.4.3-RC1 + LU-4584 patch (ie, patch-set #7 of Gerrit change #9488).
            I am currently building a Lustre version based on 2.4.2-14chaos and see if I can reproduce in-house, but I wanted to ask you if you tried to run your LU-4584 reproducer against it already ?

            bfaccini Bruno Faccini (Inactive) added a comment - Chris, I am really sorry about the fact my b2_4 patch for LU-4584 seems to cause you so many troubles. This LBUG already has not occured when running auto-tests and your/ LU-4584 reproducer, against Jenkins build for patch-set #7 of Gerrit change #9488, and I am still unable to reproduce it when running with a local build made of tag 2.4.3-RC1 + LU-4584 patch (ie, patch-set #7 of Gerrit change #9488). I am currently building a Lustre version based on 2.4.2-14chaos and see if I can reproduce in-house, but I wanted to ask you if you tried to run your LU-4584 reproducer against it already ?

            James, "crash" tool claims "WARNING: kernel version inconsistency between vmlinux and dumpfile" and gives up due to further errors ...
            Are you sure that at the time of crash you were running with this "2.6.32-431.17.1.el6.wc" kernel version for which you provided the debuginfo RPMs ? BTW, strings out from vmcore and vmcore-dmesg.txt report "2.6.32-358.23.2.el6.atlas"...

            bfaccini Bruno Faccini (Inactive) added a comment - James, "crash" tool claims "WARNING: kernel version inconsistency between vmlinux and dumpfile" and gives up due to further errors ... Are you sure that at the time of crash you were running with this "2.6.32-431.17.1.el6.wc" kernel version for which you provided the debuginfo RPMs ? BTW, strings out from vmcore and vmcore-dmesg.txt report "2.6.32-358.23.2.el6.atlas"...

            Sorry I forgot the debug.rpms. I just uploaded them to the same ftp spot. ORNL would really like to go to 2.5 ASAP.

            Yes, I built b2_5 at commit a43e0e4ce4b57240540e8d35a8ba44e203c70ae1 with some additional patches. We tag our kernels to avoid our automated rpm update system from stomping on them. They basically are the RHEL Lustre patched kernels. We also build our own kernel rpms to enable certain things like timestamps in the printk.

            simmonsja James A Simmons added a comment - Sorry I forgot the debug .rpms. I just uploaded them to the same ftp spot. ORNL would really like to go to 2.5 ASAP. Yes, I built b2_5 at commit a43e0e4ce4b57240540e8d35a8ba44e203c70ae1 with some additional patches. We tag our kernels to avoid our automated rpm update system from stomping on them. They basically are the RHEL Lustre patched kernels. We also build our own kernel rpms to enable certain things like timestamps in the printk.
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Hello James, thanks for this crash-dump already! But I also need the corresponding vmlinux and Lustre modules to allow crash tool to run on it. By "the latest b2_5 branch with the LU-2827 patch", do you mean you used one of our recent Jenkins build to get this crash? If yes, can you point me to which one ? And if not, can you also xfer the 2x kernel-debuginfo[-common] RPMs and also the luste-modules (or the lustre-debuginfo) RPM ?
            BTW, I checked the vmcore-dmesg.txt file, you provided with the crash-dump, and its content seems to indicate that you run with both a Kernel and Lustre distro you have builded locally, right?

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Hello James, thanks for this crash-dump already! But I also need the corresponding vmlinux and Lustre modules to allow crash tool to run on it. By "the latest b2_5 branch with the LU-2827 patch", do you mean you used one of our recent Jenkins build to get this crash? If yes, can you point me to which one ? And if not, can you also xfer the 2x kernel-debuginfo [-common] RPMs and also the luste-modules (or the lustre-debuginfo) RPM ? BTW, I checked the vmcore-dmesg.txt file, you provided with the crash-dump, and its content seems to indicate that you run with both a Kernel and Lustre distro you have builded locally, right?

            I hit this bug using the latest b2_5 branch with the LU-2827 patch. Good news for you is that I can share my crash dumps. A simple simul run on our Cray test bed produced this Oops. I uploaded the vmcore and dmesg to ftp.whamcloud.com/uploads/LU-5525.

            simmonsja James A Simmons added a comment - I hit this bug using the latest b2_5 branch with the LU-2827 patch. Good news for you is that I can share my crash dumps. A simple simul run on our Cray test bed produced this Oops. I uploaded the vmcore and dmesg to ftp.whamcloud.com/uploads/ LU-5525 .

            This is a not a machine for which we can provide logs or crash dumps. No, we don't know the details of the various workloads going on at the time.

            The filesystem has 768 OSTs on 768 OSS nodes. Default stripe value of 1. Some users do use stripe counts at various widths, some over 700.

            I don't know what your notes say, but I checked and we crashed the MDS fifteen times during the three days that we were running with the LU-4584 patch. We then rebooted the MDS onto Lustre version 2.4.2-14.1chaos (which does nothing but revert the LU-4584 patch), and we have not seen this ticket's assertion in over three days.

            While not conclusive, the circumstantial evidence points strongly at the LU-4584 patch either introducing the bug or making it much easier to hit.

            Here is the list of 2.4.2-14chaos to 2.4.2-14.1chaos changes:

            $ git log --oneline 2.4.2-14chaos..2.4.2-14.1chaos
            f28b8cc Revert "LU-4584 mdt: ensure orig lock is found in hash upon resend"
            

            You can find those tags at github.com/chaos/lustre.

            morrone Christopher Morrone (Inactive) added a comment - - edited This is a not a machine for which we can provide logs or crash dumps. No, we don't know the details of the various workloads going on at the time. The filesystem has 768 OSTs on 768 OSS nodes. Default stripe value of 1. Some users do use stripe counts at various widths, some over 700. I don't know what your notes say, but I checked and we crashed the MDS fifteen times during the three days that we were running with the LU-4584 patch. We then rebooted the MDS onto Lustre version 2.4.2-14.1chaos (which does nothing but revert the LU-4584 patch), and we have not seen this ticket's assertion in over three days. While not conclusive, the circumstantial evidence points strongly at the LU-4584 patch either introducing the bug or making it much easier to hit. Here is the list of 2.4.2-14chaos to 2.4.2-14.1chaos changes: $ git log --oneline 2.4.2-14chaos..2.4.2-14.1chaos f28b8cc Revert "LU-4584 mdt: ensure orig lock is found in hash upon resend" You can find those tags at github.com/chaos/lustre.

            People

              bfaccini Bruno Faccini (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: