[LU-1397] ENOENT on open() Created: 10/May/12  Updated: 07/Feb/14  Resolved: 04/Jun/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Christopher Morrone Assignee: Lai Siyao
Resolution: Duplicate Votes: 0
Labels: None
Environment:

http://github.com/chaos/lustre, version 2.1.1-11chaos


Attachments: File hype336-lu1397-1337981358181.llog.gz     File ior-lustre_debug.diff     File open-v2.stp     File open.stp    
Issue Links:
Duplicate
duplicates LU-506 FC15 patchless client support. Resolved
Related
is related to LU-3579 Performance regression after applying... Resolved
Severity: 3
Rank (Obsolete): 6399

 Description   

I've been getting widespread reports that with 2.1 clients users are seeing random ENOENT errors on opens (and maybe stats?).

Sometimes the file is written, closed, and reopened on the same client node. But the open will report that the file does not exist. A few minutes later the file is definitely there, so the problem is transitory.

We have also had instances of this where the ENOENT occurs on a node other than where the file was created. One node will create, write, and close the file, and then another will attempt to open it only to get ENOENT.

Here is an example failure from a simul test:

09:04:12: Set iteration 4
09:04:12: Running test #0(iter 0): open, shared mode.
09:04:12:       Beginning setup
09:04:12:       Finished setup          (0.001 sec)
09:04:12:       Beginning test
09:04:12: Process 177(hype338): FAILED in simul_open, open failed: No such file or directory

There tend to not be any obvious messages in the console logs associated with these events.



 Comments   
Comment by Andreas Dilger [ 10/May/12 ]

It would be useful to determine if this error is being generated internal to the client (e.g. due to dcache revalidation failing for some reason), or if it is being sent from the MDS (e.g. due to a race condition with the directory or OI lookup).

The major dcache changes for 2.6.38 were not landed in 2.1, so that is fortunately not a possible candidate for problems, but it is still possible that kernel dcache changes have caused potential problems (e.g. if RHEL has back-ported the dcache changes from 2.6.38 to 2.6.32).

Is it possible to reproduce this problem with at least +vfstrace and +rpctrace debugging enabled, and capture debug logs from the failing client and MDS ASAP after test failure?

It appears that we are running simul as part of the hyperion testing for ~4 hours at a time without seeing these problems, but these test runs are on 2.2 and not 2.1:
https://maloo.whamcloud.com/sub_tests/39e3e4be-8eaa-11e1-8a98-525400d2bfa6
https://maloo.whamcloud.com/sub_tests/5bd67e30-99b6-11e1-9853-52540035b04c
https://maloo.whamcloud.com/sub_tests/ee20c0de-9a51-11e1-9853-52540035b04c

A similar problem may have been reported in LU-1328.

Comment by Christopher Morrone [ 10/May/12 ]

I am not sure how to reproduce it reliably yet. We're shaking out the LU-1299 patch, but after that I want to use our hype cluster to work on reproducing this bug.

Comment by Christopher Morrone [ 11/May/12 ]

We probably hit this a couple of times in overnight testing. It is not often enough to turn on logs and just hope someone sees it soon enough to gather them. We'll need some kind of automated log dump trigger.

Comment by Peter Jones [ 12/May/12 ]

Chris

In light of Andreas's comments, I am wondering if you are carrying the LU1234 patch which backports some of the 2.6.38 dcache work

Peter

Comment by Christopher Morrone [ 14/May/12 ]

Yes, we do have that. It appeared in 2.1.1-5chaos, but didn't get installed anywhere until fairly recently with 2.1.1-11chaos. I have been getting reports of transient failed opens from before that patch was installed. Not that it couldn't be adding to the problem...

Comment by Christopher Morrone [ 14/May/12 ]

We haven't been able to reproduce this on demand yet. So far it only hits when we aren't looking for it. I think that a debug patch is required here to make progress.

Comment by Lai Siyao [ 15/May/12 ]

Chris, the debug patch for b2_1: http://review.whamcloud.com/#change,2793.

Comment by Christopher Morrone [ 15/May/12 ]

Is that patch going to be safe to run on a production machine? Genuine failed lookups are also a normal event. I fear that the console load, especially on the MDS, is going to be a problem.

Comment by Christopher Morrone [ 15/May/12 ]

It looks like we may have hit this a few times last night. I'll try the patch on our test system and have the same tests rerun.

Comment by Christopher Morrone [ 15/May/12 ]

It looks like we got a hit with your patch.

Note that this was a testing build of lustre that I tagged as "2.1.1-11chaos6morrone". See the code here:

https://github.com/chaos/lustre/tree/2.1.1-11chaos6morrone

I just ran an IOR, and at the end of the write phase it tried to stat() one of the files and failed:

srun -N 120 -n $((120*8)) src/ior -F -e -g -C -o /p/lcrater2/morrone/foo -t 1m -b 256m
access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   --------   ----
ior ERROR: stat() failed, errno 2, No such file or directory (aiori-POSIX.c:323)
[515] [MPI Abort by user] Aborting Program!
[515:hype292] Abort: MPI_Abort() code: -1, rank 515, MPI Abort by user Aborting program !: at line 97 in file mpid_init.c
srun: mvapich: 2012-05-15T18:12:57: ABORT from MPI rank 515 [on hype292]

and the only error I see is on the MDS from your debug patch:

May 15 18:11:02 zwicky-mds2 kernel: LustreError: 19865:0:(mdt_open.c:1485:mdt_reint_open()) open [0x13680001:0x53129225:0x0]/(morrone->[0x2000013a0:0x7d:0x0]) cr_flag=0104200200001 mode=0040700 msg_flag=0x0 failed: -66
May 15 18:14:38 zwicky-mds2 kernel: LustreError: 20279:0:(mdt_open.c:1485:mdt_reint_open()) open [0x13680001:0x53129225:0x0]/(morrone->[0x2000013a0:0x7d:0x0]) cr_flag=0104200200001 mode=0040700 msg_flag=0x0 failed: -66

After the IOR failure, I logged into the same node that reported the failure and was able to "ls -l" the file in question.

Comment by Lai Siyao [ 15/May/12 ]

I'm a bit confused, ior failed on stat(), but why MDS reported open failure. Could you setup a trigger to dump MDS log upon such error? I'm reviewing relative code now.

Comment by Lai Siyao [ 16/May/12 ]

The opened file 'morrone' is a directory, so 'stat' could trigger opendir, and cr_flag 0104200200001 shows it requests OPEN lock, but it failed with -EREMOTE.

There is one strange line of code in MDS OPEN lock handling (which set rc to -EREMOTE upon success), I'll update the patch. Could you run the test again with the new patch?

Comment by Christopher Morrone [ 16/May/12 ]

Yes, I can add that to our testing tomorrow. Although that test does not reliably produce that failure, so I'm not sure how long we'll need to run before we know anything.

Comment by Christopher Morrone [ 21/May/12 ]

I looks like -EREMOTE is no longer showing up, but the patch is not fixing the ENOENT problem.

I am also not seeing much in the logs from the patch at the times the problems are hitting.

The only errors that I see are this on the client:

2012-05-20 05:22:00 LustreError: 87891:0:(file.c:685:ll_file_open()) open [0x200497bc4:0xd842:0x0] failed: -21.

and this on the server:

May 20 03:31:58 zwicky-mds2 kernel: LustreError: 21103:0:(mdt_open.c:1484:mdt_reint_open()) open [0x2004978f5:0x8d7a:0x0]/(test_dir->[0x2004978f5:0x8d7c:0x0]) cr_flag=0100000102 mode=0040700 msg_flag=0x0 failed: -21

But those do not occur near the time of the ENOENT in the application, and are likely normal errors.

Comment by Christopher Morrone [ 21/May/12 ]

FYI, this is one of our top bugs now.

Comment by Lai Siyao [ 21/May/12 ]

ll_file_open() return -21 (-EISDIR) looks suspicious, and previous test result shows this is a opendir failure. Could you verify [0x200497bc4:0xd842:0x0] is dir? I'll review related code.

Comment by Christopher Morrone [ 21/May/12 ]

Yes, [0x200497bc4:0xd842:0x0] is in fact a directory, which is why I suspect that it is just normal operation and not part of this bug.

Comment by Lai Siyao [ 22/May/12 ]

Does it mean the debug patch may not catch the -ENOENT error? If so, I need update the debug patch, but firstly I want to know is it possible to set a trigger to dump client and MDS debug log upon this failure? BTW, it's best to turn on 'info' and 'vfstrace' trace if debuglog can be dumped.

Comment by Christopher Morrone [ 22/May/12 ]

That is correct. I have not yet seen -ENOENT in the logs on a client where the problem has occurred, or on the server.

I don't have the time to set things up myself. I am going to assign Prakash to help on this ticket. If you can design the trigger yourself and make it easy for us to run, there are more people on our side that can help with that.

Comment by Prakash Surya (Inactive) [ 22/May/12 ]

Lai, Is there an easy way to trigger a dump from within Lustre? If we can narrow down a debug patch that will only trigger when this bug happens (i.e. not on a normal error), perhaps we can add an LBUG and dump the logs that way?

Comment by Ned Bass [ 22/May/12 ]

You could just do something like this to dump a stacktrace and the debug log.

libcfs_debug_dumpstack(NULL);
libcfs_debug_dumplog();

Comment by Lai Siyao [ 22/May/12 ]

Are you using IOR to reproduce this? If so, IMHO it's easier to modify IOR code to trigger debuglog dump by execute a command line `lctl dk ...` upon error in IOR operation. I'll try to do it later, but I don't know the IOR version you're using, could you give a link of the IOR you're using?

Comment by Ned Bass [ 22/May/12 ]

https://github.com/chaos/ior

Comment by Lai Siyao [ 22/May/12 ]

Patch to trigger lustre debuglog dump upon IOR failure.

Comment by Lai Siyao [ 22/May/12 ]

I've uploaded the patch to trigger lustre debuglog dump upon IOR failure, but you need update the code for actual MDS name or address.

Comment by Prakash Surya (Inactive) [ 23/May/12 ]

Thanks Lai. I'll try this approach and give an update if it triggers with the patch applied.

Comment by Christopher Morrone [ 23/May/12 ]

Are you using IOR to reproduce this?

No. We are using a suite of applications, any one of which might hit the ENOENT. IOR might be one of them, but if we only patch that we may have to wait a week between hits.

Comment by Prakash Surya (Inactive) [ 23/May/12 ]

Can somebody help me get my bearings in the lustre client code and help me determine the code path where ENOENT would be returned from ll_file_open (or any other related functions)?

I'm curious if this error is actually coming from the kernel without ever getting to the lustre client (perhaps through an issue with the dcache?), or if it is indeed being returned by the lustre client.

Lai, I'm running the a patched IOR (https://github.com/Prakash-Surya/ior/commits/LU-1397) although as Chris mentioned, it will probably take much time before it triggers on this specific application. A system level trigger would be preferable, although I'm not sure how to filter out the normal ENOENT cases from our specific buggy instance.

Comment by Lai Siyao [ 24/May/12 ]

Prakash, I'm also suspicious that ENOENT may happen in directory pathname lookup, or some issue in CWD handling in lustre .

Could system-tap run on your lustre client? If yes, I'll try to make a script to check __link_path_walk() return value, if it's -ENOENT, then it's time to trigger lustre debuglog dump.

Comment by Prakash Surya (Inactive) [ 24/May/12 ]

Yes, system-tap is definitely an option. I wrote a simple system-tap script and installed in on the clients late yesterday to probe ll_file_open and do_sys_open, but haven't gotten a hit yet. I'll upload the script for completeness.

Comment by Prakash Surya (Inactive) [ 24/May/12 ]

I installed this system-tap script on some of the clients.

Comment by Lai Siyao [ 25/May/12 ]

Please enable more debug trace: `lctl set_param debug=+"dlmtrace dentry inode"` on client if possible.

Comment by Prakash Surya (Inactive) [ 25/May/12 ]

I fixed some issues with the previous script, and now have this running on the clients.

It's probing sys_open, and ll_file_open. It will dump lustre client logs when those return with errors.

It also adds rpctrace, vfstrace, dlmtrace, dentry, and inode debugging levels for lustre. Hopefully when we hit the issue with this installed, we'll be able to get some useful information out of the system.

Comment by Prakash Surya (Inactive) [ 25/May/12 ]

In order to probe the path walk function, I need to find a way to limit it to only the lustre filesystem. The "name" variable is undefined when I probe it, so I can't simply strncmp it's prefix as I do in the sys_open case. ENOENT is much too common a case to handle without filtering it to only Lustre.

Comment by Prakash Surya (Inactive) [ 25/May/12 ]

Lai, I think I got a hit with the v2 system tap script in place (I'm still waiting to verify with the user):

May 25 14:29:18 hype336 Lustre: LU-1397 (1337981358181): sys_open: rc = -2, filename = /p/lcrater2/swltest/SWL/Hype2/IO/37988/test_dir/miranda_io.out.10018
May 25 14:29:18 hype336 Lustre: LU-1397 (1337981358181): Dumping llog to /tmp/lu1397-1337981358181.llog

So, sys_open returned -ENOENT, but ll_file_open did not report an error. I assume the kernel returned the error without ever getting to the lustre code, but I can't say for sure as I don't understand the code path well enough.

Comment by Prakash Surya (Inactive) [ 25/May/12 ]

Regarding my previous comment, here is the client lustre log file dumped by the stap script.

Comment by Prakash Surya (Inactive) [ 25/May/12 ]

I talked to the user, and the job that triggered the error was on it's first iteration in which it was opening and creating new files to write to. 'ls'-ing the directory shows that the file is indeed missing. My current guess is that an error in the pathname lookup is the root cause.

Comment by nasf (Inactive) [ 25/May/12 ]

Lai, there are some clew in the logs from Parkash:

=============
00000080:00200000:2.0:1337981358.181813:0:54766:0:(file.c:2465:ll_inode_permission()) VFS Op:inode=144237863481960082/33582994(ffff8803b184d838), inode mode 41c0 mask 100000080:00002000:2.0:1337981358.181814:0:54766:0:(dcache.c:103:ll_dcompare()) found name test_dir(ffff8804245c3800) - flags 0/0 - refc 1
=============

The thread "54766" is the just the thread to lookup "/p/lcrater2/swltest/SWL/Hype2/IO/37988/test_dir/miranda_io.out.10018". But when it tried to find the parent "test_dir", it got an invalid dentry "ffff8804245c3800", although it was not marked as "DCACHE_LUSTRE_INVALID". Because the valid "test_dir" should be "ffff8804302af900". So the the "d_inode" for such invalid dentry "ffff8804245c3800" should be NULL, then VFS path parse failed at:

static int __link_path_walk(const char *name, struct nameidata *nd)
{
...
                /* This does the actual lookups.. */
                err = do_lookup(nd, &this, &next);
                if (err)
                        break;

                err = -ENOENT;
                inode = next.dentry->d_inode;
                if (!inode)
                        goto out_dput;
...
}

That may be why Lustre did "NOT" report "ENOENT", but VFS did. We can follow this clew for further investigation.

Comment by nasf (Inactive) [ 27/May/12 ]

After more investigation, I think that the operation sequences on the client when failure occurred were as following:

1) The thread "54770" was one IOR thread which was trying to parsing "/p/lcrater2/swltest/SWL/Hype2/IO/37988/test_dir/miranda_io.out.10022". Before processed the last component "miranda_io.out.10022", it needed to lookup the "test_dir" firstly.

2) The dentry "ffff8804302af900" for the "test_dir" existed in memory at that time, but was marked as "DCACHE_LUSTRE_INVALID". So "ll_compare()" against such dentry "ffff8804302af900" returned unmatched for the thread "54770". So the thread "54770" created new negative dentry "ffff8804245c3800" for the "test_dir" (reverse guess from the result), and triggered real lookup:

00000080:00002000:6.0:1337981358.181442:0:54770:0:(dcache.c:103:ll_dcompare()) found name test_dir(ffff8804302af900) - flags 0/4000000 - refc 25
00000080:00200000:6.0:1337981358.181444:0:54770:0:(namei.c:454:ll_lookup_it()) VFS Op:name=test_dir,dir=144237863481960082/33582994(ffff8803b184d838),intent=0
...
00800000:00000002:6.0:1337981358.181449:0:54770:0:(lmv_intent.c:405:lmv_intent_lookup()) LOOKUP_INTENT with fid1=[0x2006f9298:0xe692:0x0], fid2=[0x0:0x0:0x0], name='test_dir' -> mds #0
00000002:00010000:6.0:1337981358.181450:0:54770:0:(mdc_locks.c:917:mdc_intent_lock()) (name: test_dir,[0x0:0x0:0x0]) in obj [0x2006f9298:0xe692:0x0], intent: lookup flags 00

3) The thread "54770" succeeded to re-obtain related ibits lock from MDS, and tried to update inode and dentry for the "test_dir" on the client:

00000002:00002000:6.0:1337981358.181799:0:54770:0:(mdc_locks.c:844:mdc_finish_intent_lock()) D_IT dentry test_dir intent: lookup status 0 disp b rc 0
...
00000080:00200000:6.0:1337981358.181809:0:54770:0:(namei.c:151:ll_iget()) got inode: ffff8803b184d3f8 for [0x2006f9298:0xe693:0x0]

4) When the thread "54770" was in the function "ll_splice_alias()" to process the "test_dir", another thread "54766" (the failure thread as described above) began to lookup the "test_dir" for parsing "/p/lcrater2/swltest/SWL/Hype2/IO/37988/test_dir/miranda_io.out.10018".

5) There was race condition between the thread "54770" and the thread "54766" for the dentry "test_dir":

struct dentry *ll_splice_alias(struct inode *inode, struct dentry *de)
{
        struct dentry *new;

        if (inode) {
               new = ll_find_alias(inode, de);
               if (new) {
                      ll_dops_init(new, 1, 1);
(5.1)===>                      d_rehash(de);
(5.3)===>                      d_move(new, de);
                      iput(inode);
                      CDEBUG(D_DENTRY,
                             "Reuse dentry %p inode %p refc %d flags %#x\n",
                             new, new->d_inode, d_refcount(new), new->d_flags);
                      return new;
               }
        }
...
}


int ll_dcompare(struct dentry *parent, struct qstr *d_name, struct qstr *name)
{
...
(5.2)===>        CDEBUG(D_DENTRY,"found name %.*s(%p) - flags %d/%x - refc %d\n",
               name->len, name->name, dchild,
               d_mountpoint(dchild), dchild->d_flags & DCACHE_LUSTRE_INVALID,
               atomic_read(&dchild->d_count));
...
}

5.1) The thread "54770" found the former dentry "ffff8804302af900", and tried to reuse "ffff8804302af900". But firstly, it rehashed the negative dentry "ffff8804245c3800" before established the inode alias.

5.2) The thread "54766" found the negative dentry "ffff8804245c3800" in "__d_lookup()", which was just rehashed by the thread "54770". Since the negative dentry "ffff8804245c3800" is not marked as "DCACHE_LUSTRE_INVALID", the thread "54766" regarded it as valid entry for the "test_dir".

5.3) The thread "54770" called "d_move()" to establish relative inode alias. After that, the exported dentry for "test_dir" became the valid "ffff8804302af900".

00000080:00010000:6.0:1337981358.181811:0:54770:0:(namei.c:410:ll_lookup_it_finish()) setting l_data to inode ffff8803b184d3f8 (144237863481960083/33582994)
...
00000080:00002000:2.0:1337981358.181812:0:54766:0:(dcache.c:103:ll_dcompare()) found name 37988(ffff8804302af540) - flags 0/0 - refc 6
00000080:00200000:2.0:1337981358.181812:0:54766:0:(dcache.c:359:ll_revalidate_it()) VFS Op:name=37988,intent=0
00000080:00200000:2.0:1337981358.181813:0:54766:0:(file.c:2465:ll_inode_permission()) VFS Op:inode=144237863481960082/33582994(ffff8803b184d838), inode mode 41c0 mask 1
00000080:00002000:2.0:1337981358.181814:0:54766:0:(dcache.c:103:ll_dcompare()) found name test_dir(ffff8804245c3800) - flags 0/0 - refc 1
00000080:00002000:6.0:1337981358.181815:0:54770:0:(namei.c:378:ll_splice_alias()) Reuse dentry ffff8804302af900 inode ffff8803b184d3f8 refc 26 flags 0x4000000
00000080:00010000:6.0:1337981358.181817:0:54770:0:(dcache.c:324:ll_lookup_finish_locks()) setting l_data to inode ffff8803b184d3f8 (144237863481960083/33582994)
00000080:00010000:6.0:1337981358.181818:0:54770:0:(dcache.c:233:ll_intent_drop_lock()) releasing lock with cookie 0x829fa7d835b2db86 from it ffff880431f49b98

So here, we should not make the negative dentry "ffff8804245c3800" to be visible to others in "ll_splice_alias()". Either remove "rehash" or mark the dentry as "DCACHE_LUSTRE_INVALID". So need more suitable processing for that.

Comment by Lai Siyao [ 27/May/12 ]

Yong, you're right, I think it's better to remove d_rehash(), this looks to be a bug from old kernel (from d_splice_alias(), and this line is removed in latest code).

I'll commit a patch later.

Comment by Lai Siyao [ 27/May/12 ]

Patch is updated: http://review.whamcloud.com/#change,2400.

Please revert previous patch in your code base and patch against this one.

Comment by Prakash Surya (Inactive) [ 29/May/12 ]

Lai, was this issue introduced by one of the earlier versions of that patch? I ask because we have a vague recollection that we saw ENOENT issues prior to applying it, albeit less frequently than we currently do. So, if this specific case was introduced (and now fixed?) by http://review.whamcloud.com/2400, there still may be other issues lurking.

Either way, I am going to try and get the new revision of the patch applied and tested today.

And thanks for the detailed explanation Fan Yong! It's very helpful.

Comment by Lai Siyao [ 29/May/12 ]

Prakash, yes, the ENOENT failure found this time is introduced by the earlier patch for LU-1234. It's best to enable debuglog trigger as well in your verification test, thanks.

Comment by Prakash Surya (Inactive) [ 30/May/12 ]

Ok, Thanks Lai. We should start testing with the new patch set in place today.

Comment by Prakash Surya (Inactive) [ 04/Jun/12 ]

Thanks for the fix, Lai. We haven't seen any ENOENT failures in the past few days of testing, with patch set 4 applied. This can be marked resolved.

Comment by Peter Jones [ 04/Jun/12 ]

Thanks Prakash. We will track landing this code under LU-506 so I am closing this ticket as a duplicate of that. As to whether this fix will also address the instances you may have observed prior to applying the intially flawed LU-1234 patch, it may well do because the cache mechanism has been altered by this change.

Generated at Sat Feb 10 01:16:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.