Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.17.0
    • Lustre 2.12.5, Lustre 2.15.1
    • None
    • Affected Client OSes: CentOS 7.8.2003, Rocky Linux release 9.1
      Kernels: 5.14.0-162.12.1.el9_1.0.2.x86_64, 3.10.0-1127.8.2.el7.x86_64
    • 3
    • 9223372036854775807

    Description

      The following sequence has a strange issue that does not affect all clients:

      sesser@hercules-login-1 sesser$touch a; mkdir test; touch test; ln -svf $(pwd)/a test/
      ln: test/: cannot overwrite directory
      sesser@hercules-login-1 sesser$ln -svf $(pwd)/a test/
      'test/a' -> '/work2/hpc/users/sesser/a'
      sesser@hercules-login-1 sesser$ln -svf $(pwd)/a test/
      'test/a' -> '/work2/hpc/users/sesser/a'
      sesser@hercules-login-1 sesser$touch test; ln -svf $(pwd)/a test/
      ln: test/: cannot overwrite directory
      sesser@hercules-login-1 sesser$touch test; ln -svf $(pwd)/a test/
      ln: test/: cannot overwrite directory
      sesser@hercules-login-1 sesser$touch test; ls -l; ln -svf $(pwd)/a test/
      total 16
      rw-r---- 1 sesser admin 0 Jan 5 16:48 a
      drwxr-x--- 2 sesser admin 16384 Jan 5 16:48 test
      'test/a' -> '/work2/hpc/users/sesser/a'

      Issuing the following outputs this:

      touch test; strace ln -svf $(pwd)/a test/

      symlinkat("/work2/hpc/users/sesser/a", AT_FDCWD, "test/") = -1 ENOENT (No such file or directory)
      newfstatat(AT_FDCWD, "test/",

      {st_mode=S_IFDIR|0750, st_size=16384, ...}

      , AT_SYMLINK_NOFOLLOW) = 0
      openat(AT_FDCWD, "/usr/share/locale/C.UTF-8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
      openat(AT_FDCWD, "/usr/share/locale/C.utf8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
      openat(AT_FDCWD, "/usr/share/locale/C/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
      write(2, "ln: ", 4ln: ) = 4
      write(2, "test/: cannot overwrite director"..., 33test/: cannot overwrite directory) = 33
      write(2, "\n", 1
      ) = 1
      lseek(0, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
      close(0) = 0
      close(1) = 0
      close(2) = 0
      exit_group(1) = ?
      +++ exited with 1 +++

      This is a vendor agnostic problem, as we tested this on another system from a different vendor, and the results are the same. Some clients do behave as expected though.

      Client Details that Work Correctly:
      Client Type 1:

      • OS: CentOS 7.6.1810
      • Kernel: 3.10.0-957.27.2.el7.x86_64
      • Lustre Version: 2.12.8_ddn9
      • Mount Options: defaults, _netdev, user_xattr, flock

      Client Type 2:

      • OS: CentOS 7.8.2003
      • Kernel: 3.10.0-1127.8.2.el7.x86_64
      • Lustre Version: 2.12.5
      • Mount Options: defaults, _netdev, user_xattr, flock

      Client Details that do not Work Correctly:
      Client Type 3:

      • OS: CentOS 7.8.2003
      • Kernel: 3.10.0-1127.8.2.el7.x86_64
      • Lustre Version: 2.15.6
      • Mount Options: defaults, _netdev, user_xattr, flock

      Client Type 4:

      • Rocky 9.1
      • Kernel: 5.14.0-162.12.1.el9_1.0.2.x86_64
      • Lustre Version: 12.15.1
      • Mount Options: defaults, _netdev, user_xattr, flock

      All clients were built using the following commands:

      ./configure --disable-server --enable-quota --enable-mpitests=no
      make
      make check
      make rpms
      yum -y install *.rpms

      Attachments

        Activity

          [LU-17660] Symlink Bug with Lustre Client

          John, it would be useful if you submitted a test patch yourself, both because we don't have anyone available right now that can work on this issue, and also because it is good to have more external contributors to the project. Having a good test case to reproduce the problem is the first step in fixing the issue.

          It looks like sanity.sh test_17* have the symlink tests, and the next one would be test_17p.

          adilger Andreas Dilger added a comment - John, it would be useful if you submitted a test patch yourself, both because we don't have anyone available right now that can work on this issue, and also because it is good to have more external contributors to the project. Having a good test case to reproduce the problem is the first step in fixing the issue. It looks like sanity.sh test_17* have the symlink tests, and the next one would be test_17p .
          jbradley John Bradley added a comment -

          Is this something I need to do, or a general comment?

          jbradley John Bradley added a comment - Is this something I need to do, or a general comment?

          It would be useful to submit a patch with this in a new test case in lustre/tests/sanity.sh and then it could be run through autotest on all the different distros to see which ones are failing, and capture debug logs from the client(s) to see what is going wrong.  That will also provide a regression test for the eventual code fix. 

          adilger Andreas Dilger added a comment - It would be useful to submit a patch with this in a new test case in lustre/tests/sanity.sh and then it could be run through autotest on all the different distros to see which ones are failing, and capture debug logs from the client(s) to see what is going wrong.  That will also provide a regression test for the eventual code fix. 
          jbradley John Bradley added a comment -

          This is tested against a local filesystem on the system above:

          [root@hercules-01-06 ~]# pwd
          /root
          [root@hercules-01-06 ~]# rm -rf a test; touch a; mkdir test; touch test; ln -svf $(pwd)/a test/; ln -svf $(pwd)/a test/
          'test/a' -> '/root/a'
          'test/a' -> '/root/a'
          [root@hercules-01-06 ~]# strace ln -svf $(pwd)/a test/                                                              
          execve("/usr/bin/ln", ["ln", "-svf", "/root/a", "test/"], 0x7fff15494948 /* 45 vars */) = 0
          :
          symlinkat("/root/a", AT_FDCWD, "test/") = -1 EEXIST (File exists)
          openat(AT_FDCWD, "test/", O_RDONLY|O_PATH|O_DIRECTORY) = 3
          symlinkat("/root/a", 3, "a")            = -1 EEXIST (File exists)
          newfstatat(3, "a", {st_mode=S_IFLNK|0777, st_size=7, ...}, AT_SYMLINK_NOFOLLOW) = 0
          newfstatat(AT_FDCWD, "/root/a", {st_mode=S_IFREG|0644, st_size=0, ...}, 0) = 0
          openat(AT_FDCWD, "/dev/urandom", O_RDONLY) = 4
          read(4, "C\36\31\24v\252", 6)           = 6
          close(4)                                = 0
          getpid()                                = 4044
          getppid()                               = 4041
          getuid()                                = 0
          getgid()                                = 0
          symlinkat("/root/a", 3, "CuvA6JTS")     = 0
          renameat(3, "CuvA6JTS", 3, "a")         = 0
          newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x2), ...}, AT_EMPTY_PATH) = 0
          write(1, "'test/a' -> '/root/a'\n", 22'test/a' -> '/root/a'
          ) = 22
          lseek(0, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
          close(0)                                = 0
          close(1)                                = 0
          close(2)                                = 0
          exit_group(0)                           = ?
          +++ exited with 0 +++
          
          jbradley John Bradley added a comment - This is tested against a local filesystem on the system above: [root@hercules-01-06 ~]# pwd /root [root@hercules-01-06 ~]# rm -rf a test; touch a; mkdir test; touch test; ln -svf $(pwd)/a test/; ln -svf $(pwd)/a test/ 'test/a' -> '/root/a' 'test/a' -> '/root/a' [root@hercules-01-06 ~]# strace ln -svf $(pwd)/a test/ execve("/usr/bin/ln", ["ln", "-svf", "/root/a", "test/"], 0x7fff15494948 /* 45 vars */) = 0 : symlinkat("/root/a", AT_FDCWD, "test/") = -1 EEXIST (File exists) openat(AT_FDCWD, "test/", O_RDONLY|O_PATH|O_DIRECTORY) = 3 symlinkat("/root/a", 3, "a") = -1 EEXIST (File exists) newfstatat(3, "a", {st_mode=S_IFLNK|0777, st_size=7, ...}, AT_SYMLINK_NOFOLLOW) = 0 newfstatat(AT_FDCWD, "/root/a", {st_mode=S_IFREG|0644, st_size=0, ...}, 0) = 0 openat(AT_FDCWD, "/dev/urandom", O_RDONLY) = 4 read(4, "C\36\31\24v\252", 6) = 6 close(4) = 0 getpid() = 4044 getppid() = 4041 getuid() = 0 getgid() = 0 symlinkat("/root/a", 3, "CuvA6JTS") = 0 renameat(3, "CuvA6JTS", 3, "a") = 0 newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x2), ...}, AT_EMPTY_PATH) = 0 write(1, "'test/a' -> '/root/a'\n", 22'test/a' -> '/root/a' ) = 22 lseek(0, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) close(0) = 0 close(1) = 0 close(2) = 0 exit_group(0) = ? +++ exited with 0 +++
          adilger Andreas Dilger added a comment - - edited

          John, i edited your strace output to remove all the library loading, which is detracting from the core filesystem operations that are causing the issue.

          What is strange is that the initial failure seems to be doing the right thing, since it isn't possible to symlink on top of an existing directory? On a working test (el8.7) it looks like ln checks for the existence of the target first:

          $ cat /etc/redhat-release 
          AlmaLinux release 8.8 (Sapphire Caracal)
          $ uname -r
          4.18.0-477.21.1.el8_8.x86_64
          stat("test", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
          lstat("test/a", 0x7fff5bcc9af0)         = -1 ENOENT (No such file or directory)
          symlinkat("/home/adilger/tmp/a", AT_FDCWD, "test/a") = 0
          

          but I don't see this in your strace (and I double checked that I didn't delete it).

          It is strange that the symlinkat() call returns -ENOENT instead of -EISDIR or -EEXIST or similar. Can you also provide an (abbreviated) strace from a non-lustre filesystem when it is working properly?

          I suspect this is new behavior from ln/glibc using symlinkat() instead of doing the stat() first, and somehow the lookup of test/ is not working the first time until it is in the client cache.

          adilger Andreas Dilger added a comment - - edited John, i edited your strace output to remove all the library loading, which is detracting from the core filesystem operations that are causing the issue. What is strange is that the initial failure seems to be doing the right thing, since it isn't possible to symlink on top of an existing directory? On a working test (el8.7) it looks like ln checks for the existence of the target first: $ cat /etc/redhat-release AlmaLinux release 8.8 (Sapphire Caracal) $ uname -r 4.18.0-477.21.1.el8_8.x86_64 stat("test", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0 lstat("test/a", 0x7fff5bcc9af0) = -1 ENOENT (No such file or directory) symlinkat("/home/adilger/tmp/a", AT_FDCWD, "test/a") = 0 but I don't see this in your strace (and I double checked that I didn't delete it). It is strange that the symlinkat() call returns -ENOENT instead of -EISDIR or -EEXIST or similar. Can you also provide an (abbreviated) strace from a non-lustre filesystem when it is working properly? I suspect this is new behavior from ln/glibc using symlinkat() instead of doing the stat() first, and somehow the lookup of test/ is not working the first time until it is in the client cache.
          jbradley John Bradley added a comment - - edited

          We are still working on a test for 2.12.9, however we have tested {{2.15.4.

          [root@hercules-01-06 jbradley]# rm -rf a test; touch a; mkdir test; touch test; ln -svf $(pwd)/a test/; ln -svf $(pwd)/a test/
          ln: test/: cannot overwrite directory
          'test/a' -> '/work/hpc/users/jbradley/a'
          [root@hercules-01-06 jbradley]# cat /sys/fs/lustre/version 
          2.15.4
          [root@hercules-01-06 jbradley]# cat /etc/redhat-release 
          Rocky Linux release 9.1 (Blue Onyx)
          [root@hercules-01-06 jbradley]# uname -r
          5.14.0-162.6.1.el9_1.0.1.x86_64
          [root@hercules-01-06 jbradley]# strace ln -svf $(pwd)/a test/
          execve("/usr/bin/ln", ["ln", "-svf", "/work/hpc/users/jbradley/a", "test/"], 0x7ffe6395d378 /* 45 vars */) = 0
          :
          symlinkat("/work/hpc/users/jbradley/a", AT_FDCWD, "test/") = -1 ENOENT (No such file or directory)
          newfstatat(AT_FDCWD, "test/", \{st_mode=S_IFDIR|0755, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
          openat(AT_FDCWD, "/usr/share/locale/C.UTF-8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
          openat(AT_FDCWD, "/usr/share/locale/C.utf8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
          openat(AT_FDCWD, "/usr/share/locale/C/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
          write(2, "ln: ", 4ln: )                     = 4
          write(2, "test/: cannot overwrite director"..., 33test/: cannot overwrite directory) = 33
          +++ exited with 1 +++
          
          jbradley John Bradley added a comment - - edited We are still working on a test for 2.12.9, however we have tested {{2.15.4. [root@hercules-01-06 jbradley]# rm -rf a test; touch a; mkdir test; touch test; ln -svf $(pwd)/a test/; ln -svf $(pwd)/a test/ ln: test/: cannot overwrite directory 'test/a' -> '/work/hpc/users/jbradley/a' [root@hercules-01-06 jbradley]# cat /sys/fs/lustre/version  2.15.4 [root@hercules-01-06 jbradley]# cat /etc/redhat-release  Rocky Linux release 9.1 (Blue Onyx) [root@hercules-01-06 jbradley]# uname -r 5.14.0-162.6.1.el9_1.0.1.x86_64 [root@hercules-01-06 jbradley]# strace ln -svf $(pwd)/a test/ execve("/usr/bin/ln", ["ln", "-svf", "/work/hpc/users/jbradley/a", "test/"], 0x7ffe6395d378 /* 45 vars */) = 0 : symlinkat("/work/hpc/users/jbradley/a", AT_FDCWD, "test/") = -1 ENOENT (No such file or directory) newfstatat(AT_FDCWD, "test/", \{st_mode=S_IFDIR|0755, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0 openat(AT_FDCWD, "/usr/share/locale/C.UTF-8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/share/locale/C.utf8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/share/locale/C/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) write(2, "ln: ", 4ln: )                     = 4 write(2, "test/: cannot overwrite director"..., 33test/: cannot overwrite directory) = 33 +++ exited with 1 +++
          jbradley John Bradley added a comment -

          Client Details that work Correctly:
          Client Type 1:
          OS: CentOS 7.6.1810
          Kernel: 3.10.0-957.27.2.el7.x86_64
          Lustre Version: 2.12.8_ddn9
          Mount Options: defaults, _netdev, user_xattr, flock

          Client Type 2:
          OS: CentOS 7.8.2003
          Kernel: 3.10.0-1127.8.2.el7.x86_64
          Lustre Version: 2.12.5
          Mount Options: defaults, _netdev, user_xattr, flock

          Client Details that do not work Correctly:
          Client Type 3:
          OS: Rocky 9.1
          Kernel: 5.14.0-162.12.1.el9_1.0.2.x86_64
          Lustre Version: 2.15.56 (HPE Version 2.15.B12, parity with 2.15.3)
          Mount Options: defaults, _netdev, user_xattr, flock

          Client Type 4:
          OS: Rocky 9.1
          Kernel: 5.14.0-162.12.1.el9_1.0.2.x86_64
          Lustre Version: 2.15.1
          Mount Options: defaults, _netdev, user_xattr, flock

          jbradley John Bradley added a comment - Client Details that work Correctly: Client Type 1: OS: CentOS 7.6.1810 Kernel: 3.10.0-957.27.2.el7.x86_64 Lustre Version: 2.12.8_ddn9 Mount Options: defaults, _netdev, user_xattr, flock Client Type 2: OS: CentOS 7.8.2003 Kernel: 3.10.0-1127.8.2.el7.x86_64 Lustre Version: 2.12.5 Mount Options: defaults, _netdev, user_xattr, flock Client Details that do not work Correctly: Client Type 3: OS: Rocky 9.1 Kernel: 5.14.0-162.12.1.el9_1.0.2.x86_64 Lustre Version: 2.15.56 (HPE Version 2.15.B12, parity with 2.15.3) Mount Options: defaults, _netdev, user_xattr, flock Client Type 4: OS: Rocky 9.1 Kernel: 5.14.0-162.12.1.el9_1.0.2.x86_64 Lustre Version: 2.15.1 Mount Options: defaults, _netdev, user_xattr, flock
          jbradley John Bradley added a comment -

          I realize that the 2.15.56 release may be confusing - let me track down where this release came from exactly.

          jbradley John Bradley added a comment - I realize that the 2.15.56 release may be confusing - let me track down where this release came from exactly.
          jbradley John Bradley added a comment -

          Sorry, I mistyped that 2.15.6 release - it is supposed to be the 2.15.56 release, and the other is 2.15.1, correct.

          We will be testing the latest releases soon to see if this issue has been fixed.

          jbradley John Bradley added a comment - Sorry, I mistyped that 2.15.6 release - it is supposed to be the 2.15.56 release, and the other is 2.15.1, correct. We will be testing the latest releases soon to see if this issue has been fixed.

          Can you please confirm the versions at the end. There is no 2.15.6 release, maybe 2.12.6 but I'm not sure? Also, 12.15.1 doesn't exists probably you mean 2.15.1?
          More importantly, does this work properly with the latest 2.12.9 and 2.15.4 clients? If yes, then it has already been fixed.

          adilger Andreas Dilger added a comment - Can you please confirm the versions at the end. There is no 2.15.6 release, maybe 2.12.6 but I'm not sure? Also, 12.15.1 doesn't exists probably you mean 2.15.1? More importantly, does this work properly with the latest 2.12.9 and 2.15.4 clients? If yes, then it has already been fixed.

          People

            flei Feng Lei
            jbradley John Bradley
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: