[LU-12997] getcwd() returns ENOENT on RHEL7 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Not a Bug
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.12.2
Labels:
None
Environment:
RHEL7.7 as well as other minor versions of RHEL7 on x86_64.

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We have been seeing getcwd() return ENOENT on directories that are, in
fact, always there. We can reliably reproduce this problem with the
attached test-getcwd.c code on Lustre Server 2.12.2 and Lustre Client
2.12.3 on RHEL7.7 as well as many other Lustre version and RHEL7
version combinations.

We see reports in ~~LU-9735~~ about RHEL7 clients getting an ENOENT return
from getcwd(), but I don't understand if a solution is in the works or
not. We are also not sure if this is a Lustre problem, an RHEL kernel
problem, or both.

The LD_PRELOAD workaround from ~~LU-9735~~ is working for us, but I am
wondering if there is a proper solution pending. Is there anything we
can do to help?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

test-getcwd.c
1 kB
21/Nov/19 10:53 PM

Issue Links

is related to

LU-9868 dcache/namei fixes for lustre

Open

Activity

[LU-12997] getcwd() returns ENOENT on RHEL7

James A Simmons added a comment - 14/Apr/20 2:32 PM - edited

Sigh. RedHat claimed this was fixed. Its going to take some push to get them to resolve this. I don't have the power to resolve this. Some one with greater influence with RedHat will have to discuss a fix.

James A Simmons added a comment - 14/Apr/20 2:32 PM - edited Sigh. RedHat claimed this was fixed. Its going to take some push to get them to resolve this. I don't have the power to resolve this. Some one with greater influence with RedHat will have to discuss a fix.

K. Scott Rowe added a comment - 14/Apr/20 2:18 PM

Perhaps I wasn't clear enough.

This is still a problem.

I can still reproduce this error with RHEL7.8 using the test-getcwd.c program above.

K. Scott Rowe added a comment - 14/Apr/20 2:18 PM Perhaps I wasn't clear enough. This is still a problem. I can still reproduce this error with RHEL7.8 using the test-getcwd.c program above.

James A Simmons added a comment - 14/Apr/20 1:13 PM

RHEL7.8 contains a fix so this can be closed. If people encounter this issue please move to RHEL7.8

James A Simmons added a comment - 14/Apr/20 1:13 PM RHEL7.8 contains a fix so this can be closed. If people encounter this issue please move to RHEL7.8

K. Scott Rowe added a comment - 03/Apr/20 6:53 PM - edited

W installed a machine with RHEL-7.8 using kernel 3.10.0-1127.el7.x86_64 and while I see a bug fix for a getcwd problem

$ rpm -qi --changelog kernel|grep getcwd
- [fs] vfs: close race between getcwd() and d_move() (Miklos Szeredi) [1631631]

1631631 is a different bug id than 1811124 that James A Simmons reported. And, I can still reproduce the problem on our Lustre-2.10.8 filesystem using the 2.12.4 client on RHEL-7.8.

$ ./test-getcwd /lustre/aoc/sciops/krowe/tmp
test-getcwd: test-getcwd.c:44: main: Assertion `rc == 0' failed.
Aborted

K. Scott Rowe added a comment - 03/Apr/20 6:53 PM - edited W installed a machine with RHEL-7.8 using kernel 3.10.0-1127.el7.x86_64 and while I see a bug fix for a getcwd problem $ rpm -qi --changelog kernel|grep getcwd - [fs] vfs: close race between getcwd() and d_move() (Miklos Szeredi) [1631631] 1631631 is a different bug id than 1811124 that James A Simmons reported. And, I can still reproduce the problem on our Lustre-2.10.8 filesystem using the 2.12.4 client on RHEL-7.8. $ ./test-getcwd /lustre/aoc/sciops/krowe/tmp test-getcwd: test-getcwd.c:44: main: Assertion `rc == 0' failed. Aborted

James A Simmons added a comment - 11/Mar/20 12:48 PM

I was told by RedHat that a fix was landed to RHEL7.8

James A Simmons added a comment - 11/Mar/20 12:48 PM I was told by RedHat that a fix was landed to RHEL7.8

K. Scott Rowe added a comment - 06/Mar/20 10:21 PM

Thanks for continuing to work on this. I have some good news. We upgraded our servers from Lustre-2.5.5 to Lustre-2.10.8 on Feb. 20, 2020 and the problem has been greatly reduced.

The test program (test-getcwd.c) still fails when run on our Lustre filesystem but the number of times a user has run into this getcwd() problem with other code since we upgraded our Lustre servers has dropped to almost zero.

K. Scott Rowe added a comment - 06/Mar/20 10:21 PM Thanks for continuing to work on this. I have some good news. We upgraded our servers from Lustre-2.5.5 to Lustre-2.10.8 on Feb. 20, 2020 and the problem has been greatly reduced. The test program (test-getcwd.c) still fails when run on our Lustre filesystem but the number of times a user has run into this getcwd() problem with other code since we upgraded our Lustre servers has dropped to almost zero.

James A Simmons added a comment - 06/Mar/20 4:20 PM

I submitted a bugzilla:

https://bugzilla.redhat.com/show_bug.cgi?id=1811124

James A Simmons added a comment - 06/Mar/20 4:20 PM I submitted a bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1811124

Neil Brown added a comment - 25/Nov/19 12:34 AM

This sounds like the bug fixed upstream by

Commit 61647823aa92 ("VFS: close race between getcwd() and d_move()")

Fixed in v4.16

Probably the best approach for lustre supporting older kernels is to copy d_drop() from a newer kernel into libcfs, and use that instead of the exported d_drop.

Neil Brown added a comment - 25/Nov/19 12:34 AM This sounds like the bug fixed upstream by Commit 61647823aa92 ("VFS: close race between getcwd() and d_move()") Fixed in v4.16 Probably the best approach for lustre supporting older kernels is to copy d_drop() from a newer kernel into libcfs, and use that instead of the exported d_drop.

James A Simmons added a comment - 22/Nov/19 12:32 AM

This is a race in ll_splice_alias() due to the use of d_move() when the inode is for a directories. Their are fixes for this but it corrects the way Lustre handles it dcache which causes other types of breakage. I tried some ideas to fix this but its a work in progress. Basically the bug is $MOUNT/.lustre/fid/$fid_for_mount means that dentry $fid_for_mount == dentry $MOUNT which causes a circular loop that crashes the node. I might be able to handle this special case using d_real() that is in newer kernels.

James A Simmons added a comment - 22/Nov/19 12:32 AM This is a race in ll_splice_alias() due to the use of d_move() when the inode is for a directories. Their are fixes for this but it corrects the way Lustre handles it dcache which causes other types of breakage. I tried some ideas to fix this but its a work in progress. Basically the bug is $MOUNT/.lustre/fid/$fid_for_mount means that dentry $fid_for_mount == dentry $MOUNT which causes a circular loop that crashes the node. I might be able to handle this special case using d_real() that is in newer kernels.

People

Assignee:: WC Triage

Reporter:: K. Scott Rowe

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 21/Nov/19 10:55 PM

Updated:: 13/Jul/20 8:47 PM

Resolved:: 13/Jul/20 8:47 PM