[LU-12997] getcwd() returns ENOENT on RHEL7 Created: 21/Nov/19 Updated: 13/Jul/20 Resolved: 13/Jul/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | K. Scott Rowe | Assignee: | WC Triage |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL7.7 as well as other minor versions of RHEL7 on x86_64. |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We have been seeing getcwd() return ENOENT on directories that are, in We see reports in The LD_PRELOAD workaround from |
| Comments |
| Comment by James A Simmons [ 22/Nov/19 ] |
|
This is a race in ll_splice_alias() due to the use of d_move() when the inode is for a directories. Their are fixes for this but it corrects the way Lustre handles it dcache which causes other types of breakage. I tried some ideas to fix this but its a work in progress. Basically the bug is $MOUNT/.lustre/fid/$fid_for_mount means that dentry $fid_for_mount == dentry $MOUNT which causes a circular loop that crashes the node. I might be able to handle this special case using d_real() that is in newer kernels. |
| Comment by Neil Brown [ 25/Nov/19 ] |
|
This sounds like the bug fixed upstream by Commit 61647823aa92 ("VFS: close race between getcwd() and d_move()") Fixed in v4.16 Probably the best approach for lustre supporting older kernels is to copy d_drop() from a newer kernel into libcfs, and use that instead of the exported d_drop.
|
| Comment by James A Simmons [ 06/Mar/20 ] |
|
I submitted a bugzilla: |
| Comment by K. Scott Rowe [ 06/Mar/20 ] |
|
Thanks for continuing to work on this. I have some good news. We upgraded our servers from Lustre-2.5.5 to Lustre-2.10.8 on Feb. 20, 2020 and the problem has been greatly reduced. The test program (test-getcwd.c) still fails when run on our Lustre filesystem but the number of times a user has run into this getcwd() problem with other code since we upgraded our Lustre servers has dropped to almost zero.
|
| Comment by James A Simmons [ 11/Mar/20 ] |
|
I was told by RedHat that a fix was landed to RHEL7.8 |
| Comment by K. Scott Rowe [ 03/Apr/20 ] |
|
W installed a machine with RHEL-7.8 using kernel 3.10.0-1127.el7.x86_64 and while I see a bug fix for a getcwd problem
$ rpm -qi --changelog kernel|grep getcwd - [fs] vfs: close race between getcwd() and d_move() (Miklos Szeredi) [1631631]
1631631 is a different bug id than 1811124 that James A Simmons reported. And, I can still reproduce the problem on our Lustre-2.10.8 filesystem using the 2.12.4 client on RHEL-7.8.
$ ./test-getcwd /lustre/aoc/sciops/krowe/tmp test-getcwd: test-getcwd.c:44: main: Assertion `rc == 0' failed. Aborted
|
| Comment by James A Simmons [ 14/Apr/20 ] |
|
RHEL7.8 contains a fix so this can be closed. If people encounter this issue please move to RHEL7.8 |
| Comment by K. Scott Rowe [ 14/Apr/20 ] |
|
Perhaps I wasn't clear enough. This is still a problem. I can still reproduce this error with RHEL7.8 using the test-getcwd.c program above.
|
| Comment by James A Simmons [ 14/Apr/20 ] |
|
Sigh. RedHat claimed this was fixed. Its going to take some push to get them to resolve this. I don't have the power to resolve this. Some one with greater influence with RedHat will have to discuss a fix. |
| Comment by James A Simmons [ 14/Apr/20 ] |
|
Peter can you take over this issue since you seem to have better relations with RedHat to resolve this. |
| Comment by K. Scott Rowe [ 14/Apr/20 ] |
|
Do you have the ability to test this on an RHEL7.8 host? It would be good to have a second data point. I suppose it is possible I am seeing this issue with our RHEL7.8 host for some other reason that I can't think of. |
| Comment by K. Scott Rowe [ 19/May/20 ] |
|
The kernel was just upgraded on my test RHEL-7.8 machine. It is now running (3.10.0-1127.8.2.el7.x86_64) and I no longer get getcwd() failures $ ./test-getcwd /lustre/aoc/sciops/krowe/tmp getcwd succeeded I don't understand why this failed with kernel 3.10.0-1127.el7.x86_64 and works now but assuming it continues to work after more kernel updates I would say this problem may be fixed. Again, if you have the ability to check this yourself, please do. My environment may be customized in strange ways. |
| Comment by K. Scott Rowe [ 13/Jul/20 ] |
|
I have since tested a later kernel, 3.10.0-1127.13.1.el7.x86_64, and it also works. So I think the solution is to upgrade to at least kernel 3.10.0-1127.el7.x86_64.
This ticket can be closed. Thanks for your help.
|