Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.17.0
    • Lustre 2.12.5, Lustre 2.15.1
    • None
    • Affected Client OSes: CentOS 7.8.2003, Rocky Linux release 9.1
      Kernels: 5.14.0-162.12.1.el9_1.0.2.x86_64, 3.10.0-1127.8.2.el7.x86_64
    • 3
    • 9223372036854775807

    Description

      The following sequence has a strange issue that does not affect all clients:

      sesser@hercules-login-1 sesser$touch a; mkdir test; touch test; ln -svf $(pwd)/a test/
      ln: test/: cannot overwrite directory
      sesser@hercules-login-1 sesser$ln -svf $(pwd)/a test/
      'test/a' -> '/work2/hpc/users/sesser/a'
      sesser@hercules-login-1 sesser$ln -svf $(pwd)/a test/
      'test/a' -> '/work2/hpc/users/sesser/a'
      sesser@hercules-login-1 sesser$touch test; ln -svf $(pwd)/a test/
      ln: test/: cannot overwrite directory
      sesser@hercules-login-1 sesser$touch test; ln -svf $(pwd)/a test/
      ln: test/: cannot overwrite directory
      sesser@hercules-login-1 sesser$touch test; ls -l; ln -svf $(pwd)/a test/
      total 16
      rw-r---- 1 sesser admin 0 Jan 5 16:48 a
      drwxr-x--- 2 sesser admin 16384 Jan 5 16:48 test
      'test/a' -> '/work2/hpc/users/sesser/a'

      Issuing the following outputs this:

      touch test; strace ln -svf $(pwd)/a test/

      symlinkat("/work2/hpc/users/sesser/a", AT_FDCWD, "test/") = -1 ENOENT (No such file or directory)
      newfstatat(AT_FDCWD, "test/",

      {st_mode=S_IFDIR|0750, st_size=16384, ...}

      , AT_SYMLINK_NOFOLLOW) = 0
      openat(AT_FDCWD, "/usr/share/locale/C.UTF-8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
      openat(AT_FDCWD, "/usr/share/locale/C.utf8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
      openat(AT_FDCWD, "/usr/share/locale/C/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
      write(2, "ln: ", 4ln: ) = 4
      write(2, "test/: cannot overwrite director"..., 33test/: cannot overwrite directory) = 33
      write(2, "\n", 1
      ) = 1
      lseek(0, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
      close(0) = 0
      close(1) = 0
      close(2) = 0
      exit_group(1) = ?
      +++ exited with 1 +++

      This is a vendor agnostic problem, as we tested this on another system from a different vendor, and the results are the same. Some clients do behave as expected though.

      Client Details that Work Correctly:
      Client Type 1:

      • OS: CentOS 7.6.1810
      • Kernel: 3.10.0-957.27.2.el7.x86_64
      • Lustre Version: 2.12.8_ddn9
      • Mount Options: defaults, _netdev, user_xattr, flock

      Client Type 2:

      • OS: CentOS 7.8.2003
      • Kernel: 3.10.0-1127.8.2.el7.x86_64
      • Lustre Version: 2.12.5
      • Mount Options: defaults, _netdev, user_xattr, flock

      Client Details that do not Work Correctly:
      Client Type 3:

      • OS: CentOS 7.8.2003
      • Kernel: 3.10.0-1127.8.2.el7.x86_64
      • Lustre Version: 2.15.6
      • Mount Options: defaults, _netdev, user_xattr, flock

      Client Type 4:

      • Rocky 9.1
      • Kernel: 5.14.0-162.12.1.el9_1.0.2.x86_64
      • Lustre Version: 12.15.1
      • Mount Options: defaults, _netdev, user_xattr, flock

      All clients were built using the following commands:

      ./configure --disable-server --enable-quota --enable-mpitests=no
      make
      make check
      make rpms
      yum -y install *.rpms

      Attachments

        Activity

          [LU-17660] Symlink Bug with Lustre Client
          jbradley John Bradley added a comment -

          We do not have a contract with RHEL - I'll see if we have one with CIQ, as we use Rocky in our environment.

          jbradley John Bradley added a comment - We do not have a contract with RHEL - I'll see if we have one with CIQ, as we use Rocky in our environment.

          jbradley do you have a support contract with RHEL? If yes, it would be useful to raise a ticket in their bugzilla about this, referencing the patch commit v6.9-rc4-39-gbb32cded3be2 to see if they will backport the fix into their kernel.

          adilger Andreas Dilger added a comment - jbradley do you have a support contract with RHEL? If yes, it would be useful to raise a ticket in their bugzilla about this, referencing the patch commit v6.9-rc4-39-gbb32cded3be2 to see if they will backport the fix into their kernel.

          It looks like that commit was landed in kernel v6.9-rc4-39-gbb32cded3be2.

          adilger Andreas Dilger added a comment - It looks like that commit was landed in kernel v6.9-rc4-39-gbb32cded3be2.
          flei Feng Lei added a comment -

          The bug appears on only el9.x series client.

          flei Feng Lei added a comment - The bug appears on only el9.x series client.
          flei Feng Lei added a comment - - edited

          To work around this, stat the target dir instead of touch it, then ln file under the dir:

          # touch a
          # mkdir test
          # stat test
          # ln -sf a test/
          
          flei Feng Lei added a comment - - edited To work around this, stat the target dir instead of touch it, then ln file under the dir: # touch a # mkdir test # stat test # ln -sf a test/
          flei Feng Lei added a comment -

          It seems to be a kernel bug and is fixed later in kernel source:

          commit b3d4650d82c71b9c9a8184de9e8bb656012b289e
          Author: NeilBrown <neilb@suse.de>
          Date:   Thu Apr 14 13:57:35 2022 +1000
              VFS: filename_create(): fix incorrect intent.
              
              When asked to create a path ending '/', but which is not to be a
              directory (LOOKUP_DIRECTORY not set), filename_create() will never try
              to create the file.  If it doesn't exist, -ENOENT is reported.
              
              However, it still passes LOOKUP_CREATE|LOOKUP_EXCL to the filesystems
              ->lookup() function, even though there is no intent to create.  This is
              misleading and can cause incorrect behaviour.
              
              If you try
              
                 ln -s foo /path/dir/
              
              where 'dir' is a directory on an NFS filesystem which is not currently
              known in the dcache, this will fail with ENOENT.
              
              But as the name is not in the dcache, nfs_lookup gets called with
              LOOKUP_CREATE|LOOKUP_EXCL and so it returns NULL without performing any
              lookup, with the expectation that a subsequent call to create the target
              will be made, and the lookup can be combined with the creation.  In the
              case with a trailing '/' and no LOOKUP_DIRECTORY, that call is never
              made.  Instead filename_create() sees that the dentry is not (yet)
              positive and returns -ENOENT - even though the directory actually
              exists.
              
              So only set LOOKUP_CREATE|LOOKUP_EXCL if there really is an intent to
              create, and use the absence of these flags to decide if -ENOENT should
              be returned.
              
              Note that filename_parentat() is only interested in LOOKUP_REVAL, so we
              split that out and store it in 'reval_flag'.  __lookup_hash() then gets
              reval_flag combined with whatever create flags were determined to be
              needed.
           
          flei Feng Lei added a comment - It seems to be a kernel bug and is fixed later in kernel source : commit b3d4650d82c71b9c9a8184de9e8bb656012b289e Author: NeilBrown <neilb@suse.de> Date:   Thu Apr 14 13:57:35 2022 +1000     VFS: filename_create(): fix incorrect intent.          When asked to create a path ending '/' , but which is not to be a     directory (LOOKUP_DIRECTORY not set), filename_create() will never try     to create the file.  If it doesn't exist, -ENOENT is reported.          However, it still passes LOOKUP_CREATE|LOOKUP_EXCL to the filesystems     ->lookup() function, even though there is no intent to create.  This is     misleading and can cause incorrect behaviour.          If you try             ln -s foo /path/dir/          where 'dir' is a directory on an NFS filesystem which is not currently     known in the dcache, this will fail with ENOENT.          But as the name is not in the dcache, nfs_lookup gets called with     LOOKUP_CREATE|LOOKUP_EXCL and so it returns NULL without performing any     lookup, with the expectation that a subsequent call to create the target     will be made, and the lookup can be combined with the creation.  In the     case with a trailing '/' and no LOOKUP_DIRECTORY, that call is never     made.  Instead filename_create() sees that the dentry is not (yet)     positive and returns -ENOENT - even though the directory actually     exists.          So only set LOOKUP_CREATE|LOOKUP_EXCL if there really is an intent to     create, and use the absence of these flags to decide if -ENOENT should     be returned.          Note that filename_parentat() is only interested in LOOKUP_REVAL, so we     split that out and store it in 'reval_flag' .  __lookup_hash() then gets     reval_flag combined with whatever create flags were determined to be     needed.

          "Feng Lei <flei@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56639
          Subject: LU-17660 tests: test symlink file to existing dir
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 4972941aaeaf495706191290a402a3ba18a3630b

          gerrit Gerrit Updater added a comment - "Feng Lei <flei@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56639 Subject: LU-17660 tests: test symlink file to existing dir Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4972941aaeaf495706191290a402a3ba18a3630b

          John, it would be useful if you submitted a test patch yourself, both because we don't have anyone available right now that can work on this issue, and also because it is good to have more external contributors to the project. Having a good test case to reproduce the problem is the first step in fixing the issue.

          It looks like sanity.sh test_17* have the symlink tests, and the next one would be test_17p.

          adilger Andreas Dilger added a comment - John, it would be useful if you submitted a test patch yourself, both because we don't have anyone available right now that can work on this issue, and also because it is good to have more external contributors to the project. Having a good test case to reproduce the problem is the first step in fixing the issue. It looks like sanity.sh test_17* have the symlink tests, and the next one would be test_17p .
          jbradley John Bradley added a comment -

          Is this something I need to do, or a general comment?

          jbradley John Bradley added a comment - Is this something I need to do, or a general comment?

          It would be useful to submit a patch with this in a new test case in lustre/tests/sanity.sh and then it could be run through autotest on all the different distros to see which ones are failing, and capture debug logs from the client(s) to see what is going wrong.  That will also provide a regression test for the eventual code fix. 

          adilger Andreas Dilger added a comment - It would be useful to submit a patch with this in a new test case in lustre/tests/sanity.sh and then it could be run through autotest on all the different distros to see which ones are failing, and capture debug logs from the client(s) to see what is going wrong.  That will also provide a regression test for the eventual code fix. 
          jbradley John Bradley added a comment -

          This is tested against a local filesystem on the system above:

          [root@hercules-01-06 ~]# pwd
          /root
          [root@hercules-01-06 ~]# rm -rf a test; touch a; mkdir test; touch test; ln -svf $(pwd)/a test/; ln -svf $(pwd)/a test/
          'test/a' -> '/root/a'
          'test/a' -> '/root/a'
          [root@hercules-01-06 ~]# strace ln -svf $(pwd)/a test/                                                              
          execve("/usr/bin/ln", ["ln", "-svf", "/root/a", "test/"], 0x7fff15494948 /* 45 vars */) = 0
          :
          symlinkat("/root/a", AT_FDCWD, "test/") = -1 EEXIST (File exists)
          openat(AT_FDCWD, "test/", O_RDONLY|O_PATH|O_DIRECTORY) = 3
          symlinkat("/root/a", 3, "a")            = -1 EEXIST (File exists)
          newfstatat(3, "a", {st_mode=S_IFLNK|0777, st_size=7, ...}, AT_SYMLINK_NOFOLLOW) = 0
          newfstatat(AT_FDCWD, "/root/a", {st_mode=S_IFREG|0644, st_size=0, ...}, 0) = 0
          openat(AT_FDCWD, "/dev/urandom", O_RDONLY) = 4
          read(4, "C\36\31\24v\252", 6)           = 6
          close(4)                                = 0
          getpid()                                = 4044
          getppid()                               = 4041
          getuid()                                = 0
          getgid()                                = 0
          symlinkat("/root/a", 3, "CuvA6JTS")     = 0
          renameat(3, "CuvA6JTS", 3, "a")         = 0
          newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x2), ...}, AT_EMPTY_PATH) = 0
          write(1, "'test/a' -> '/root/a'\n", 22'test/a' -> '/root/a'
          ) = 22
          lseek(0, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
          close(0)                                = 0
          close(1)                                = 0
          close(2)                                = 0
          exit_group(0)                           = ?
          +++ exited with 0 +++
          
          jbradley John Bradley added a comment - This is tested against a local filesystem on the system above: [root@hercules-01-06 ~]# pwd /root [root@hercules-01-06 ~]# rm -rf a test; touch a; mkdir test; touch test; ln -svf $(pwd)/a test/; ln -svf $(pwd)/a test/ 'test/a' -> '/root/a' 'test/a' -> '/root/a' [root@hercules-01-06 ~]# strace ln -svf $(pwd)/a test/ execve("/usr/bin/ln", ["ln", "-svf", "/root/a", "test/"], 0x7fff15494948 /* 45 vars */) = 0 : symlinkat("/root/a", AT_FDCWD, "test/") = -1 EEXIST (File exists) openat(AT_FDCWD, "test/", O_RDONLY|O_PATH|O_DIRECTORY) = 3 symlinkat("/root/a", 3, "a") = -1 EEXIST (File exists) newfstatat(3, "a", {st_mode=S_IFLNK|0777, st_size=7, ...}, AT_SYMLINK_NOFOLLOW) = 0 newfstatat(AT_FDCWD, "/root/a", {st_mode=S_IFREG|0644, st_size=0, ...}, 0) = 0 openat(AT_FDCWD, "/dev/urandom", O_RDONLY) = 4 read(4, "C\36\31\24v\252", 6) = 6 close(4) = 0 getpid() = 4044 getppid() = 4041 getuid() = 0 getgid() = 0 symlinkat("/root/a", 3, "CuvA6JTS") = 0 renameat(3, "CuvA6JTS", 3, "a") = 0 newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x2), ...}, AT_EMPTY_PATH) = 0 write(1, "'test/a' -> '/root/a'\n", 22'test/a' -> '/root/a' ) = 22 lseek(0, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) close(0) = 0 close(1) = 0 close(2) = 0 exit_group(0) = ? +++ exited with 0 +++

          People

            flei Feng Lei
            jbradley John Bradley
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: