Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16589

sanityn test_55d: FAIL: (2) mv succeeded

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0, Lustre 2.15.3
    • Lustre 2.16.0, Lustre 2.15.3
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for jianyu <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/5fae0359-520f-440f-9515-09d401c4ae9c

      test_55d failed with the following error:

      == sanityn test 55d: rename file vs link ================= 17:25:02 (1677173102)
      CMD: onyx-65vm4 /usr/sbin/lctl set_param fail_loc=0x155
      fail_loc=0x155
      ln: failed to create hard link '/mnt/lustre2/d55d.sanityn/d55d.sanityn/' => '/mnt/lustre2/d55d.sanityn/f1': No such file or directory
       sanityn test_55d: @@@@@@ FAIL: (2) mv succeeded
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master/4392 - 5.14.21-150400.24.28-default
      servers: https://build.whamcloud.com/job/lustre-master/4392 - 4.18.0-425.10.1.el8_lustre.x86_64

      <<Please provide additional information about the failure here>>

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanityn test_55d - (2) mv succeeded

      Attachments

        Issue Links

          Activity

            [LU-16589] sanityn test_55d: FAIL: (2) mv succeeded

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50127/
            Subject: LU-16589 tests: fix hard-link failure in sanityn/55d
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 25c6b7ad2859729197c3cc6e6dcf0621e4bda6fa

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50127/ Subject: LU-16589 tests: fix hard-link failure in sanityn/55d Project: fs/lustre-release Branch: master Current Patch Set: Commit: 25c6b7ad2859729197c3cc6e6dcf0621e4bda6fa

            "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50348
            Subject: LU-16589 tests: fix hard-link failure in sanityn/55d
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 5b3526d01808b0dd270ae0655918fa3b7fc4f941

            gerrit Gerrit Updater added a comment - "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50348 Subject: LU-16589 tests: fix hard-link failure in sanityn/55d Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 5b3526d01808b0dd270ae0655918fa3b7fc4f941

            "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50265
            Subject: LU-16589 tests: add sanity/31l to test ln command
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 8b6e20e63be8edc4fda22f2318fa8df350c7297b

            gerrit Gerrit Updater added a comment - "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50265 Subject: LU-16589 tests: add sanity/31l to test ln command Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8b6e20e63be8edc4fda22f2318fa8df350c7297b
            yujian Jian Yu added a comment -

            Sure, Andreas. I just updated https://review.whamcloud.com/50127 to fix sanityn/55d. I'm working on a separate patch to add a new subtest for "ln". BTW, I just found the actual version of coreutils that started containing the "ln" change is 8.31.

            yujian Jian Yu added a comment - Sure, Andreas. I just updated https://review.whamcloud.com/50127 to fix sanityn/55d. I'm working on a separate patch to add a new subtest for " ln ". BTW, I just found the actual version of coreutils that started containing the " ln " change is 8.31.

            Jian, then it seems this bug has existed for so long and could even be considered an issue in the new coreutils? At least it is fixed in newer kernels, and it seems like a very uncommon use case, so it isn't clear that putting a ton of effort into fixing it in Lustre is worthwhile. Neil can work on backporting the fix to SLES15 (if needed), and James or someone can file a ticket to fix it in RHEL8/9.

            It seems the right thing for now is to fix the test (either add "/f1" at the end or remove the trailing "/") so that it is executing the original intent of the patch that added it.

            A separate patch should be made with a new subtest for only this "ln" case. This new subtest should be skipped for new coreutils and old kernels:

                    local ln_ver=$(ln --version | awk '/coreutils/ { print $4 }')
            
                    (( $(version_code $ln_ver) < $(version code 8.32) )) ||
                    (( $(version_code $(uname -r)) >= $(version_code 5.18) )) ||
                            skip "need coreutils < 8.32 or kernel >= 5.18 for ln"
            
                    touch $DIR/$tfile || error "create failed"
                    mkdir $DIR/$tdir || error "mkdir failed"
                    ln $DIR/$tfile $DIR/tdir/ || error "ln to '$tdir/' failed"
                    

            This will at least add test coverage for newer kernels and keep it for old coreutils, so that it doesn't regress in the future.  Additional checks can be added to allow running the test when we know particular distro kernels are fixed. 

            adilger Andreas Dilger added a comment - Jian, then it seems this bug has existed for so long and could even be considered an issue in the new coreutils? At least it is fixed in newer kernels, and it seems like a very uncommon use case, so it isn't clear that putting a ton of effort into fixing it in Lustre is worthwhile. Neil can work on backporting the fix to SLES15 (if needed), and James or someone can file a ticket to fix it in RHEL8/9. It seems the right thing for now is to fix the test (either add "/f1" at the end or remove the trailing "/") so that it is executing the original intent of the patch that added it. A separate patch should be made with a new subtest for only this "ln" case. This new subtest should be skipped for new coreutils and old kernels: local ln_ver=$(ln --version | awk '/coreutils/ { print $4 }' ) (( $(version_code $ln_ver) < $(version code 8.32) )) || (( $(version_code $(uname -r)) >= $(version_code 5.18) )) || skip "need coreutils < 8.32 or kernel >= 5.18 for ln" touch $DIR/$tfile || error "create failed" mkdir $DIR/$tdir || error "mkdir failed" ln $DIR/$tfile $DIR/tdir/ || error "ln to '$tdir/' failed" This will at least add test coverage for newer kernels and keep it for old coreutils, so that it doesn't regress in the future.  Additional checks can be added to allow running the test when we know particular distro kernels are fixed. 
            yujian Jian Yu added a comment -

            I'm not totally sure what is the oldest version affected

            Even on RHEL 7.9 (with kernel 3.10.0-1160), after changing ln from version 8.22 to 8.32, the "ln /mnt/lustre/tfile /mnt/lustre/tdir/" command also failed with -ENOENT.
            This means the kernel issue exists in all of the vendor kernels we support now.

            yujian Jian Yu added a comment - I'm not totally sure what is the oldest version affected Even on RHEL 7.9 (with kernel 3.10.0-1160), after changing ln from version 8.22 to 8.32, the " ln /mnt/lustre/tfile /mnt/lustre/tdir/ " command also failed with -ENOENT . This means the kernel issue exists in all of the vendor kernels we support now.
            neilb Neil Brown added a comment -

            > Would it work to change ll_lookup_nd() to skip the optimization if flags contains LOOKUP_CREATE | LOOKUP_EXCL and LOOKUP_CREATE + !LOOKUP_OPEN + !LOOKUP_EXCL still use the optimization?

            I don't think ->lookup is ever called with LOOKUP_CREATE but not LOOKUP_EXCL.  And if it is, it is every likely to include LOOKUP_OPEN.  Most syscalls that create names give an error (EEXIST) if the name exists.  The only exception is open(O_CREATE).

            Do we really need to fix this is lustre?  It is clearly not a lustre bug.  If we want test_55d to succeed more reliably we can just change the ln command to "ln $DIR2/$tdir/f1 $DIR2/$tdir/$tdir/f1" as Andreas suggested.  That performs exactly the same effective test, but avoids the kernel bug.

             

            neilb Neil Brown added a comment - > Would it work to change ll_lookup_nd() to skip the optimization if flags contains LOOKUP_CREATE | LOOKUP_EXCL and LOOKUP_CREATE + !LOOKUP_OPEN + !LOOKUP_EXCL still use the optimization? I don't think ->lookup is ever called with LOOKUP_CREATE but not LOOKUP_EXCL.  And if it is, it is every likely to include LOOKUP_OPEN.  Most syscalls that create names give an error (EEXIST) if the name exists.  The only exception is open(O_CREATE). Do we really need to fix this is lustre?  It is clearly not a lustre bug.  If we want test_55d to succeed more reliably we can just change the ln command to " ln $DIR2/$tdir/f1 $DIR2/$tdir/$tdir/f1 " as Andreas suggested.  That performs exactly the same effective test, but avoids the kernel bug.  

            This workaround should go under an #ifdef so that the optimization is only disabled for kernels that are affected by it. It definitely is not needed for kernels newer than 5.18, but I'm not totally sure what is the oldest version affected, or if there is a #define we can check that was introduced in the original patch. This is also complicated by vendor backports, so there may need to be both a vanilla kernel version check and a vendor version check. 

            adilger Andreas Dilger added a comment - This workaround should go under an #ifdef so that the optimization is only disabled for kernels that are affected by it. It definitely is not needed for kernels newer than 5.18, but I'm not totally sure what is the oldest version affected, or if there is a #define we can check that was introduced in the original patch. This is also complicated by vendor backports, so there may need to be both a vanilla kernel version check and a vendor version check. 
            bobijam Zhenyu Xu added a comment -

            Would it work to change ll_lookup_nd() to skip the optimization if flags contains LOOKUP_CREATE | LOOKUP_EXCL and LOOKUP_CREATE + !LOOKUP_OPEN + !LOOKUP_EXCL still use the optimization?

            bobijam Zhenyu Xu added a comment - Would it work to change ll_lookup_nd() to skip the optimization if flags contains LOOKUP_CREATE | LOOKUP_EXCL and LOOKUP_CREATE + !LOOKUP_OPEN + !LOOKUP_EXCL still use the optimization?
            yujian Jian Yu added a comment -

            The only way I can think of to work around the problem in lustre is to skip the optimization in ll_lookup_nd() for a CREATE that isn't an OPEN.

            The optimization was added by https://review.whamcloud.com/8257 ("LU-4185 llite: Revise create with no open optimization"):

            commit a2d5b2e83c0a512a3ea59698e8481621ab5856c2
            Author:     Bobi Jam <bobijam@whamcloud.com>
            AuthorDate: Wed Nov 13 15:56:25 2013 +0800
            Commit:     Oleg Drokin <green@whamcloud.com>
            CommitDate: Wed Oct 5 03:51:18 2016 +0000
            
                LU-4185 llite: Revise create with no open optimization
                
                Currently ll_lookup_nd just returns a negative when we are trying
                to create something with no open (read mkdir). This is all fine most
                of the cases, except if the directory where we are trying to do is
                not writeable by us. In that case vfs_create would return EPERM
                seeing as how a negative dentry means the create cannot proceed.
                But in reality if there is an existing name there that we just did
                not have cached, the proper return is EEXIST that could only happen
                if we did do the lookup.
                
                So amend the optimization to only take place if the directory is
                writeable by us, otherwise do the full lookup.
            
            yujian Jian Yu added a comment - The only way I can think of to work around the problem in lustre is to skip the optimization in ll_lookup_nd() for a CREATE that isn't an OPEN. The optimization was added by https://review.whamcloud.com/8257 (" LU-4185 llite: Revise create with no open optimization"): commit a2d5b2e83c0a512a3ea59698e8481621ab5856c2 Author: Bobi Jam <bobijam@whamcloud.com> AuthorDate: Wed Nov 13 15:56:25 2013 +0800 Commit: Oleg Drokin <green@whamcloud.com> CommitDate: Wed Oct 5 03:51:18 2016 +0000 LU-4185 llite: Revise create with no open optimization Currently ll_lookup_nd just returns a negative when we are trying to create something with no open (read mkdir). This is all fine most of the cases, except if the directory where we are trying to do is not writeable by us. In that case vfs_create would return EPERM seeing as how a negative dentry means the create cannot proceed. But in reality if there is an existing name there that we just did not have cached, the proper return is EEXIST that could only happen if we did do the lookup. So amend the optimization to only take place if the directory is writeable by us, otherwise do the full lookup.
            yujian Jian Yu added a comment -

            Thank you Andreas and Neil for the detailed analysis.

            It would be interesting to see what the test does with "ln $DIR2/$tdir/f1 $DIR2/$tdir/$tdir/f1" (i.e. with the target filename specified)?

            Test passed on SLES 15 SP4 client (with kernel 5.14.21-150400.24.28-default):

            == sanityn test 55d: rename file vs link ============================================================= 05:52:19 (1678254739)
            CMD: trevis-86vm8 /usr/sbin/lctl set_param fail_loc=0x155
            fail_loc=0x155
            mv: '/mnt/lustre/d55d.sanityn/f1' and '/mnt/lustre/d55d.sanityn/d55d.sanityn/f1' are the same file
            Resetting fail_loc on all nodes...CMD: trevis-86vm7.trevis.whamcloud.com,trevis-86vm8,trevis-87vm7 lctl set_param -n fail_loc=0             fail_val=0 2>/dev/null
            done.
            CMD: trevis-86vm7.trevis.whamcloud.com /usr/sbin/lctl get_param catastrophe 2>&1
            CMD: trevis-86vm8 /usr/sbin/lctl get_param catastrophe 2>&1
            CMD: trevis-87vm7 /usr/sbin/lctl get_param catastrophe 2>&1
            CMD: trevis-86vm7.trevis.whamcloud.com,trevis-86vm8,trevis-87vm7 dmesg
            PASS 55d (9s)
            
            yujian Jian Yu added a comment - Thank you Andreas and Neil for the detailed analysis. It would be interesting to see what the test does with "ln $DIR2/$tdir/f1 $DIR2/$tdir/$tdir/f1" (i.e. with the target filename specified)? Test passed on SLES 15 SP4 client (with kernel 5.14.21-150400.24.28-default): == sanityn test 55d: rename file vs link ============================================================= 05:52:19 (1678254739) CMD: trevis-86vm8 /usr/sbin/lctl set_param fail_loc=0x155 fail_loc=0x155 mv: '/mnt/lustre/d55d.sanityn/f1' and '/mnt/lustre/d55d.sanityn/d55d.sanityn/f1' are the same file Resetting fail_loc on all nodes...CMD: trevis-86vm7.trevis.whamcloud.com,trevis-86vm8,trevis-87vm7 lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null done. CMD: trevis-86vm7.trevis.whamcloud.com /usr/sbin/lctl get_param catastrophe 2>&1 CMD: trevis-86vm8 /usr/sbin/lctl get_param catastrophe 2>&1 CMD: trevis-87vm7 /usr/sbin/lctl get_param catastrophe 2>&1 CMD: trevis-86vm7.trevis.whamcloud.com,trevis-86vm8,trevis-87vm7 dmesg PASS 55d (9s)

            People

              yujian Jian Yu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: