Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5525

ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • None
    • MDS node, Lustre 2.4.2-14chaos, ZFS OBD
    • 3
    • 15383

    Description

      After upgrading to lustre 2.4.2-14chaos (see github.com/chaos/lustre), we soon hit the following assertion on one of our MDS nodes:

      mdt_handler.c:3652:mdt_intent_lock_replace()) ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed

      Perhaps most significantly, this tag of our lustre tree includes the patch entitled:

      LU-4584 mdt: ensure orig lock is found in hash upon resend

      James Simmons reported this assertion when he tested the LU-4584 patch, but the Bruno made the evaluation that the assertion was unrelated to the patch.

      Whether it is related or not, we need to fix the problem.

      Attachments

        Issue Links

          Activity

            [LU-5525] ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed

            Looks like a bug. You can work around it by using --with-downstream-release:

            $ sh autogen.sh
            $ ./configure --with-downstream-release=14chaos --disable-liblustre
            $ make rpms
            

            If you are trying to reproduce what we see, we use zfs not ldiskfs.

            Also, is reproduction likely to be an effective approach to this problem? What do we know about the code, and how do we intend to reproduce the issue?

            morrone Christopher Morrone (Inactive) added a comment - Looks like a bug. You can work around it by using --with-downstream-release: $ sh autogen.sh $ ./configure --with-downstream-release=14chaos --disable-liblustre $ make rpms If you are trying to reproduce what we see, we use zfs not ldiskfs. Also, is reproduction likely to be an effective approach to this problem? What do we know about the code, and how do we intend to reproduce the issue?

            I 1st cloned the chaos git, then selected the 2.4.2-14chaos tag.
            And then I applied the/our usual procedure described at "https://wiki.hpdd.intel.com/pages/viewpage.action?pageId=8126821", to generate a ldiskfs-based (sorry I forgot to mention this ...) Server build, and get these error during the "make rpms" step :

            ...............
            
            Wrote: /home/bruno.faccini/kernel/rpmbuild/SRPMS/lustre-iokit-1.4.0-1.src.rpm
            make[1]: Leaving directory `/home/bruno.faccini/lustre-chaos/lustre-iokit'
            rpmbuilddir=`mktemp -t -d lustre-build-$USER-XXXXXXXX`; \
                    make  \
                            rpmbuilddir="$rpmbuilddir" \
                            rpm-local || exit 1; \
                    rpmbuild \
                            --define "_tmppath $rpmbuilddir/TMP" \
                            --define "_topdir $rpmbuilddir" \
                            --define "build_src_rpm 1" \
                            --define "dist %{nil}" \
                    -ts lustre-2.4.2.tar.gz || exit 1; \
                    cp $rpmbuilddir/SRPMS/*.src.rpm . || exit 1; \
                    rm -f -R $rpmbuilddir
            make[1]: Entering directory `/home/bruno.faccini/lustre-chaos'
            make[1]: Leaving directory `/home/bruno.faccini/lustre-chaos'
            error: Macro %rel_down has empty body
            error: Macro %rel_build has empty body
            error: line 96: Empty tag: Release:
            make: *** [srpm] Error 1
            

            So I suspect I don't use the right procedure for your tree since we don't have such "rpmbuilddir" variable/macro in our build infrastructure, which plays with/changes "_topdir" setting ...

            bfaccini Bruno Faccini (Inactive) added a comment - I 1st cloned the chaos git, then selected the 2.4.2-14chaos tag. And then I applied the/our usual procedure described at "https://wiki.hpdd.intel.com/pages/viewpage.action?pageId=8126821", to generate a ldiskfs-based (sorry I forgot to mention this ...) Server build, and get these error during the "make rpms" step : ............... Wrote: /home/bruno.faccini/kernel/rpmbuild/SRPMS/lustre-iokit-1.4.0-1.src.rpm make[1]: Leaving directory `/home/bruno.faccini/lustre-chaos/lustre-iokit' rpmbuilddir=`mktemp -t -d lustre-build-$USER-XXXXXXXX`; \ make \ rpmbuilddir="$rpmbuilddir" \ rpm-local || exit 1; \ rpmbuild \ --define "_tmppath $rpmbuilddir/TMP" \ --define "_topdir $rpmbuilddir" \ --define "build_src_rpm 1" \ --define "dist %{nil}" \ -ts lustre-2.4.2.tar.gz || exit 1; \ cp $rpmbuilddir/SRPMS/*.src.rpm . || exit 1; \ rm -f -R $rpmbuilddir make[1]: Entering directory `/home/bruno.faccini/lustre-chaos' make[1]: Leaving directory `/home/bruno.faccini/lustre-chaos' error: Macro %rel_down has empty body error: Macro %rel_build has empty body error: line 96: Empty tag: Release: make: *** [srpm] Error 1 So I suspect I don't use the right procedure for your tree since we don't have such "rpmbuilddir" variable/macro in our build infrastructure, which plays with/changes "_topdir" setting ...

            But now I encounter issues during the make step, Chris, do you use special tricks to build from your source tree ??

            Maybe? Could you be more specific about the commands you issued and the problem that you saw? If you had trouble in the liblustre part, try adding --disable-liblustre to the configure command line.

            morrone Christopher Morrone (Inactive) added a comment - But now I encounter issues during the make step, Chris, do you use special tricks to build from your source tree ?? Maybe? Could you be more specific about the commands you issued and the problem that you saw? If you had trouble in the liblustre part, try adding --disable-liblustre to the configure command line.

            Chris,
            Please, can you help+answer me about my previous request and detail any specific/procedure to build from LLNL source tree/git ?

            bfaccini Bruno Faccini (Inactive) added a comment - Chris, Please, can you help+answer me about my previous request and detail any specific/procedure to build from LLNL source tree/git ?

            James,
            It's too bad that at the time of this crash you ran with a Lustre debug mask made of only "D_IOCTL+D_NETERROR+D_WANING+D_ERROR+D_EMERG+D_HA+D_CONFIG+D_CONSOLE" and which did not contain at least "D_RPCTRACE+D_DLMTRACE" in addition ... That would have greatly helped to navigate in the crash-dump.

            bfaccini Bruno Faccini (Inactive) added a comment - James, It's too bad that at the time of this crash you ran with a Lustre debug mask made of only "D_IOCTL+D_NETERROR+D_WANING+D_ERROR+D_EMERG+D_HA+D_CONFIG+D_CONSOLE" and which did not contain at least "D_RPCTRACE+D_DLMTRACE" in addition ... That would have greatly helped to navigate in the crash-dump.

            But now I encounter issues during the make step, Chris, do you use special tricks to build from your source tree ??

            bfaccini Bruno Faccini (Inactive) added a comment - But now I encounter issues during the make step, Chris, do you use special tricks to build from your source tree ??

            Ok, found that 2.6.32-431.3.1 kernel is compatible with 2.4.2-14chaos kernel patches ...

            bfaccini Bruno Faccini (Inactive) added a comment - Ok, found that 2.6.32-431.3.1 kernel is compatible with 2.4.2-14chaos kernel patches ...

            I wanted to ask you if you tried to run your LU-4584 reproducer against it already ?

            Yes. The LU-4584 patch did appear to help with the evictions on unlink.

            morrone Christopher Morrone (Inactive) added a comment - I wanted to ask you if you tried to run your LU-4584 reproducer against it already ? Yes. The LU-4584 patch did appear to help with the evictions on unlink.

            I can report I don't see this issue with 2.5 servers. I see another problem instead.

            simmonsja James A Simmons added a comment - I can report I don't see this issue with 2.5 servers. I see another problem instead.

            Chris, can you help and tell me the kernel version you use/build with 2.4.2-14chaos, I am having problems when patching the 2.6.32-358.23.2 kernel version we use to build our latest b2_4/2.4.3-RC1 branch/tag.

            bfaccini Bruno Faccini (Inactive) added a comment - Chris, can you help and tell me the kernel version you use/build with 2.4.2-14chaos, I am having problems when patching the 2.6.32-358.23.2 kernel version we use to build our latest b2_4/2.4.3-RC1 branch/tag.

            Oops, you are right. I saw this problem with our 2.4 image which did contain the LU-4584 patch. Sorry for the confusion. I pushed the proper debug rpms this time.

            simmonsja James A Simmons added a comment - Oops, you are right. I saw this problem with our 2.4 image which did contain the LU-4584 patch. Sorry for the confusion. I pushed the proper debug rpms this time.

            People

              bfaccini Bruno Faccini (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: