[LU-12997] getcwd() returns ENOENT on RHEL7 Created: 21/Nov/19  Updated: 13/Jul/20  Resolved: 13/Jul/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.2
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: K. Scott Rowe Assignee: WC Triage
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

RHEL7.7 as well as other minor versions of RHEL7 on x86_64.


Attachments: File test-getcwd.c    
Issue Links:
Related
is related to LU-9868 dcache/namei fixes for lustre Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We have been seeing getcwd() return ENOENT on directories that are, in
fact, always there. We can reliably reproduce this problem with the
attached test-getcwd.c code on Lustre Server 2.12.2 and Lustre Client
2.12.3 on RHEL7.7 as well as many other Lustre version and RHEL7
version combinations.

We see reports in LU-9735 about RHEL7 clients getting an ENOENT return
from getcwd(), but I don't understand if a solution is in the works or
not. We are also not sure if this is a Lustre problem, an RHEL kernel
problem, or both.

The LD_PRELOAD workaround from LU-9735 is working for us, but I am
wondering if there is a proper solution pending. Is there anything we
can do to help?



 Comments   
Comment by James A Simmons [ 22/Nov/19 ]

This is a race in ll_splice_alias() due to the use of d_move() when the inode is for a directories. Their are fixes for this but it corrects the way Lustre handles it dcache which causes other types of breakage. I tried some ideas to fix this but its a work in progress. Basically the bug is $MOUNT/.lustre/fid/$fid_for_mount means that dentry $fid_for_mount == dentry $MOUNT which causes a circular loop that crashes the node. I might be able to handle this special case using d_real() that is in newer kernels.

Comment by Neil Brown [ 25/Nov/19 ]

This sounds like the bug fixed upstream by

Commit 61647823aa92 ("VFS: close race between getcwd() and d_move()")

Fixed in v4.16

Probably the best approach for lustre supporting older kernels is to copy d_drop() from a newer kernel into libcfs, and use that instead of the exported d_drop.

 

Comment by James A Simmons [ 06/Mar/20 ]

I submitted a bugzilla:

https://bugzilla.redhat.com/show_bug.cgi?id=1811124

Comment by K. Scott Rowe [ 06/Mar/20 ]

Thanks for continuing to work on this.  I have some good news.  We upgraded our servers from Lustre-2.5.5 to Lustre-2.10.8 on Feb. 20, 2020 and the problem has been greatly reduced.

The test program (test-getcwd.c) still fails when run on our Lustre filesystem but the number of times a user has run into this getcwd() problem with other code since we upgraded our Lustre servers has dropped to almost zero.

 

Comment by James A Simmons [ 11/Mar/20 ]

I was told by RedHat that a fix was landed to RHEL7.8

Comment by K. Scott Rowe [ 03/Apr/20 ]

W installed a machine with RHEL-7.8 using kernel 3.10.0-1127.el7.x86_64 and while I see a bug fix for a getcwd problem

 

 

$ rpm -qi --changelog kernel|grep getcwd
- [fs] vfs: close race between getcwd() and d_move() (Miklos Szeredi) [1631631]

 

 

1631631 is a different bug id than 1811124 that James A Simmons reported.  And, I can still reproduce the problem on our Lustre-2.10.8 filesystem using the 2.12.4 client on RHEL-7.8.

 

$ ./test-getcwd /lustre/aoc/sciops/krowe/tmp
test-getcwd: test-getcwd.c:44: main: Assertion `rc == 0' failed.
Aborted

 

 

Comment by James A Simmons [ 14/Apr/20 ]

RHEL7.8 contains a fix so this can be closed. If people encounter this issue please move to RHEL7.8

Comment by K. Scott Rowe [ 14/Apr/20 ]

Perhaps I wasn't clear enough.

This is still a problem.

I can still reproduce this error with RHEL7.8 using the test-getcwd.c program above.

 

Comment by James A Simmons [ 14/Apr/20 ]

Sigh. RedHat claimed this was fixed. Its going to take some push to get them to resolve this. I don't have the power to resolve this. Some one with greater influence with RedHat will have to discuss a fix.

Comment by James A Simmons [ 14/Apr/20 ]

Peter can you take over this issue since you seem to have better relations with RedHat to resolve this.

Comment by K. Scott Rowe [ 14/Apr/20 ]

Do you have the ability to test this on an RHEL7.8 host?  It would be good to have a second data point.  I suppose it is possible I am seeing this issue with our RHEL7.8 host for some other reason that I can't think of.

Comment by K. Scott Rowe [ 19/May/20 ]

The kernel was just upgraded on my test RHEL-7.8 machine.  It is now running (3.10.0-1127.8.2.el7.x86_64) and I no longer get getcwd() failures

$ ./test-getcwd /lustre/aoc/sciops/krowe/tmp
getcwd succeeded

I don't understand why this failed with kernel 3.10.0-1127.el7.x86_64 and works now but assuming it continues to work after more kernel updates I would say this problem may be fixed.  Again, if you have the ability to check this yourself, please do.  My environment may be customized in strange ways.

Comment by K. Scott Rowe [ 13/Jul/20 ]

I have since tested a later kernel, 3.10.0-1127.13.1.el7.x86_64, and it also works.  So I think the solution is to upgrade to at least kernel 3.10.0-1127.el7.x86_64.

 

This ticket can be closed.  Thanks for your help.

 

Generated at Sat Feb 10 02:57:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.