[LU-5765] sanity test_123a test_123b: rm: no such file or directory Created: 17/Oct/14  Updated: 14/Jan/15  Resolved: 14/Jan/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Isaac Huang (Inactive)
Resolution: Fixed Votes: 0
Labels: HB, zfs

Issue Links:
Duplicate
duplicates LU-3573 lustre-rsync-test test_8: @@@@@@ FAIL... Resolved
Related
is related to LU-6101 sanity test_24A: Can not delete direc... Resolved
Severity: 3
Rank (Obsolete): 16184

 Description   

This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

In sanity.sh test_123a and test_123b a large number of errors are being reported when "rm -r" is running:

rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity2': No such file or directory
rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity5': No such file or directory
rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity8': No such file or directory
rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity11': No such file or directory
rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity12': No such file or directory
rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity16': No such file or directory
rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity18': No such file or directory

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/186592e2-5577-11e4-8542-5254006e85c2.

Info required for matching: sanity 123a
Info required for matching: sanity 123b



 Comments   
Comment by Nathaniel Clark [ 21/Oct/14 ]

Happened on lustre-rsync-test test_2b on review-dne-part-1 on master:
https://testing.hpdd.intel.com/test_sets/d99c68e6-5653-11e4-b972-5254006e85c2

Comment by nasf (Inactive) [ 24/Oct/14 ]

Another failure instance:
https://testing.hpdd.intel.com/test_sets/e28d061c-5b13-11e4-9c62-5254006e85c2

Comment by Andreas Dilger [ 24/Oct/14 ]

Another failure https://testing.hpdd.intel.com/test_sets/729ab070-5b5e-11e4-95e9-5254006e85c2

Di, Fan Yong,
it looks like readdir is failing for some reason (e.g. looping and returning the same entry hundreds of times?) and that is why "rm" is returning an error for all of the later entries. Is this statahead, or DNE going wrong? Unfortunately, we don't have any debug logs to tell us what is going on.

Comment by Andreas Dilger [ 24/Oct/14 ]

It looks like these failures are all happening on ZFS. I don't think the lustre-rsync-test test_2b looks like the same symptom AFAICS.

Isn't there another bug open with ZFS directory iteration broken? Maybe they are related?

Comment by Isaac Huang (Inactive) [ 30/Oct/14 ]

I repeated "sanity --only 123a,123b" for 100 times, and couldn't reproduce it. I was using build lustre-b2_5/96/ with 1 OSS (2 OSTs) 1 MDS and 1 client, any other build/configuration I should try?

Comment by Isaac Huang (Inactive) [ 30/Oct/14 ]

Another 50 repetitions, still can't reproduce, although the eagle VMs I used all had just 1 CPU.

Comment by Andreas Dilger [ 30/Oct/14 ]

Isaac, I think you should be testing with master and ZFS 0.6.3, and not b2_5, since all of the failures I've seen have been on master so far.

Maybe a patch should be landed to sanity.sh test_123[ab] to help diagnose the problem if it happens again under review testing? For example, doing an "ls" of the directory after the test is done, always running "rm" under strace and logging it to a file that is attached to Maloo so that we can see if the problem is in readdir() data returned to userspace, etc. There have been 8 failures in the last 165 review-zfs tests in the past week, so I think if a patch is landed to sanity.sh on master it should be possible to get more information within a day or two. I don't think just running review testing on the patch itself is likely to see problems, unless we run sanity.sh 20x in a loop with Test-Parameters:.

Comment by nasf (Inactive) [ 30/Oct/14 ]

Isaac, another possible condition is that our Maloo test clusters run a lots VMs on single node, so the system load on Maloo clusters should be much higher than our personal test environment. So if we can simulate the similar test environment as Maloo clusters do, then it may be helpful to reproduce the issue locally.

Comment by Isaac Huang (Inactive) [ 01/Dec/14 ]

With a build on master, I'm now able to reproduce it almost 100%. I'll look into it.

Comment by Isaac Huang (Inactive) [ 02/Dec/14 ]

In one test today, test_123a took 18127 seconds to complete (more than 100x the usual time on the same hw), and there were 11575754 "no such file" errors in the test log.

Comment by Andreas Dilger [ 16/Dec/14 ]

Is it possible that this problem is the same as LU-3573 and was fixed by http://review.whamcloud.com/12904 ?

The most recent failure https://testing.hpdd.intel.com/test_sets/97ea0fd0-84f6-11e4-a60f-5254006e85c2 was on a patch based on a tree that doesn't contain the LU-3573 fix.

Before that, the most recent test failure was https://testing.hpdd.intel.com/test_sets/863b9434-fcb2-11e2-9222-52540035b04c on 2014-08-03.

Comment by Andreas Dilger [ 21/Dec/14 ]

Saw this again on a recent patch run:
https://testing.hpdd.intel.com/test_sets/e55f7412-881b-11e4-aa28-5254006e85c2
https://testing.hpdd.intel.com/test_sets/c69c83ca-87f9-11e4-a70f-5254006e85c2

Comment by Andreas Dilger [ 14/Jan/15 ]

This might also be related to LU-6101, which also has a patch.

Comment by Andreas Dilger [ 14/Jan/15 ]

Haven't seen this again for the past 4 weeks.

Closing it again. Maybe LU-6101 will fix the final trigger.

Generated at Sat Feb 10 01:54:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.