Details

    • Bug
    • Resolution: Not a Bug
    • Minor
    • None
    • Lustre 2.12.2
    • None
    • RHEL7.7 as well as other minor versions of RHEL7 on x86_64.
    • 3
    • 9223372036854775807

    Description

      We have been seeing getcwd() return ENOENT on directories that are, in
      fact, always there. We can reliably reproduce this problem with the
      attached test-getcwd.c code on Lustre Server 2.12.2 and Lustre Client
      2.12.3 on RHEL7.7 as well as many other Lustre version and RHEL7
      version combinations.

      We see reports in LU-9735 about RHEL7 clients getting an ENOENT return
      from getcwd(), but I don't understand if a solution is in the works or
      not. We are also not sure if this is a Lustre problem, an RHEL kernel
      problem, or both.

      The LD_PRELOAD workaround from LU-9735 is working for us, but I am
      wondering if there is a proper solution pending. Is there anything we
      can do to help?

      Attachments

        Issue Links

          Activity

            [LU-12997] getcwd() returns ENOENT on RHEL7
            simmonsja James A Simmons added a comment - - edited

            Sigh. RedHat claimed this was fixed. Its going to take some push to get them to resolve this. I don't have the power to resolve this. Some one with greater influence with RedHat will have to discuss a fix.

            simmonsja James A Simmons added a comment - - edited Sigh. RedHat claimed this was fixed. Its going to take some push to get them to resolve this. I don't have the power to resolve this. Some one with greater influence with RedHat will have to discuss a fix.

            Perhaps I wasn't clear enough.

            This is still a problem.

            I can still reproduce this error with RHEL7.8 using the test-getcwd.c program above.

             

            krowe K. Scott Rowe added a comment - Perhaps I wasn't clear enough. This is still a problem. I can still reproduce this error with RHEL7.8 using the test-getcwd.c program above.  

            RHEL7.8 contains a fix so this can be closed. If people encounter this issue please move to RHEL7.8

            simmonsja James A Simmons added a comment - RHEL7.8 contains a fix so this can be closed. If people encounter this issue please move to RHEL7.8
            krowe K. Scott Rowe added a comment - - edited

            W installed a machine with RHEL-7.8 using kernel 3.10.0-1127.el7.x86_64 and while I see a bug fix for a getcwd problem

             

             

            $ rpm -qi --changelog kernel|grep getcwd
            - [fs] vfs: close race between getcwd() and d_move() (Miklos Szeredi) [1631631]

             

             

            1631631 is a different bug id than 1811124 that James A Simmons reported.  And, I can still reproduce the problem on our Lustre-2.10.8 filesystem using the 2.12.4 client on RHEL-7.8.

             

            $ ./test-getcwd /lustre/aoc/sciops/krowe/tmp
            test-getcwd: test-getcwd.c:44: main: Assertion `rc == 0' failed.
            Aborted
            

             

             

            krowe K. Scott Rowe added a comment - - edited W installed a machine with RHEL-7.8 using kernel 3.10.0-1127.el7.x86_64 and while I see a bug fix for a getcwd problem     $ rpm -qi --changelog kernel|grep getcwd - [fs] vfs: close race between getcwd() and d_move() (Miklos Szeredi) [1631631]     1631631 is a different bug id than 1811124 that James A Simmons reported.  And, I can still reproduce the problem on our Lustre-2.10.8 filesystem using the 2.12.4 client on RHEL-7.8.   $ ./test-getcwd /lustre/aoc/sciops/krowe/tmp test-getcwd: test-getcwd.c:44: main: Assertion `rc == 0' failed. Aborted    

            I was told by RedHat that a fix was landed to RHEL7.8

            simmonsja James A Simmons added a comment - I was told by RedHat that a fix was landed to RHEL7.8

            Thanks for continuing to work on this.  I have some good news.  We upgraded our servers from Lustre-2.5.5 to Lustre-2.10.8 on Feb. 20, 2020 and the problem has been greatly reduced.

            The test program (test-getcwd.c) still fails when run on our Lustre filesystem but the number of times a user has run into this getcwd() problem with other code since we upgraded our Lustre servers has dropped to almost zero.

             

            krowe K. Scott Rowe added a comment - Thanks for continuing to work on this.  I have some good news.  We upgraded our servers from Lustre-2.5.5 to Lustre-2.10.8 on Feb. 20, 2020 and the problem has been greatly reduced. The test program (test-getcwd.c) still fails when run on our Lustre filesystem but the number of times a user has run into this getcwd() problem with other code since we upgraded our Lustre servers has dropped to almost zero.  
            simmonsja James A Simmons added a comment - I submitted a bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1811124
            neilb Neil Brown added a comment -

            This sounds like the bug fixed upstream by

            Commit 61647823aa92 ("VFS: close race between getcwd() and d_move()")

            Fixed in v4.16

            Probably the best approach for lustre supporting older kernels is to copy d_drop() from a newer kernel into libcfs, and use that instead of the exported d_drop.

             

            neilb Neil Brown added a comment - This sounds like the bug fixed upstream by Commit 61647823aa92 ("VFS: close race between getcwd() and d_move()") Fixed in v4.16 Probably the best approach for lustre supporting older kernels is to copy d_drop() from a newer kernel into libcfs, and use that instead of the exported d_drop.  

            This is a race in ll_splice_alias() due to the use of d_move() when the inode is for a directories. Their are fixes for this but it corrects the way Lustre handles it dcache which causes other types of breakage. I tried some ideas to fix this but its a work in progress. Basically the bug is $MOUNT/.lustre/fid/$fid_for_mount means that dentry $fid_for_mount == dentry $MOUNT which causes a circular loop that crashes the node. I might be able to handle this special case using d_real() that is in newer kernels.

            simmonsja James A Simmons added a comment - This is a race in ll_splice_alias() due to the use of d_move() when the inode is for a directories. Their are fixes for this but it corrects the way Lustre handles it dcache which causes other types of breakage. I tried some ideas to fix this but its a work in progress. Basically the bug is $MOUNT/.lustre/fid/$fid_for_mount means that dentry $fid_for_mount == dentry $MOUNT which causes a circular loop that crashes the node. I might be able to handle this special case using d_real() that is in newer kernels.

            People

              wc-triage WC Triage
              krowe K. Scott Rowe
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: