Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
None
-
None
-
http://github.com/chaos/lustre, version 2.1.1-11chaos
-
3
-
6399
Description
I've been getting widespread reports that with 2.1 clients users are seeing random ENOENT errors on opens (and maybe stats?).
Sometimes the file is written, closed, and reopened on the same client node. But the open will report that the file does not exist. A few minutes later the file is definitely there, so the problem is transitory.
We have also had instances of this where the ENOENT occurs on a node other than where the file was created. One node will create, write, and close the file, and then another will attempt to open it only to get ENOENT.
Here is an example failure from a simul test:
09:04:12: Set iteration 4 09:04:12: Running test #0(iter 0): open, shared mode. 09:04:12: Beginning setup 09:04:12: Finished setup (0.001 sec) 09:04:12: Beginning test 09:04:12: Process 177(hype338): FAILED in simul_open, open failed: No such file or directory
There tend to not be any obvious messages in the console logs associated with these events.
Lai, there are some clew in the logs from Parkash:
=============
00000080:00200000:2.0:1337981358.181813:0:54766:0:(file.c:2465:ll_inode_permission()) VFS Op:inode=144237863481960082/33582994(ffff8803b184d838), inode mode 41c0 mask 100000080:00002000:2.0:1337981358.181814:0:54766:0:(dcache.c:103:ll_dcompare()) found name test_dir(ffff8804245c3800) - flags 0/0 - refc 1
=============
The thread "54766" is the just the thread to lookup "/p/lcrater2/swltest/SWL/Hype2/IO/37988/test_dir/miranda_io.out.10018". But when it tried to find the parent "test_dir", it got an invalid dentry "ffff8804245c3800", although it was not marked as "DCACHE_LUSTRE_INVALID". Because the valid "test_dir" should be "ffff8804302af900". So the the "d_inode" for such invalid dentry "ffff8804245c3800" should be NULL, then VFS path parse failed at:
That may be why Lustre did "NOT" report "ENOENT", but VFS did. We can follow this clew for further investigation.