[LU-5765] sanity test_123a test_123b: rm: no such file or directory Created: 17/Oct/14 Updated: 14/Jan/15 Resolved: 14/Jan/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Isaac Huang (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB, zfs | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 16184 | ||||||||||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com> In sanity.sh test_123a and test_123b a large number of errors are being reported when "rm -r" is running: rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity2': No such file or directory rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity5': No such file or directory rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity8': No such file or directory rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity11': No such file or directory rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity12': No such file or directory rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity16': No such file or directory rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity18': No such file or directory This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/186592e2-5577-11e4-8542-5254006e85c2. Info required for matching: sanity 123a |
| Comments |
| Comment by Nathaniel Clark [ 21/Oct/14 ] |
|
Happened on lustre-rsync-test test_2b on review-dne-part-1 on master: |
| Comment by nasf (Inactive) [ 24/Oct/14 ] |
|
Another failure instance: |
| Comment by Andreas Dilger [ 24/Oct/14 ] |
|
Another failure https://testing.hpdd.intel.com/test_sets/729ab070-5b5e-11e4-95e9-5254006e85c2 Di, Fan Yong, |
| Comment by Andreas Dilger [ 24/Oct/14 ] |
|
It looks like these failures are all happening on ZFS. I don't think the lustre-rsync-test test_2b looks like the same symptom AFAICS. Isn't there another bug open with ZFS directory iteration broken? Maybe they are related? |
| Comment by Isaac Huang (Inactive) [ 30/Oct/14 ] |
|
I repeated "sanity --only 123a,123b" for 100 times, and couldn't reproduce it. I was using build lustre-b2_5/96/ with 1 OSS (2 OSTs) 1 MDS and 1 client, any other build/configuration I should try? |
| Comment by Isaac Huang (Inactive) [ 30/Oct/14 ] |
|
Another 50 repetitions, still can't reproduce, although the eagle VMs I used all had just 1 CPU. |
| Comment by Andreas Dilger [ 30/Oct/14 ] |
|
Isaac, I think you should be testing with master and ZFS 0.6.3, and not b2_5, since all of the failures I've seen have been on master so far. Maybe a patch should be landed to sanity.sh test_123[ab] to help diagnose the problem if it happens again under review testing? For example, doing an "ls" of the directory after the test is done, always running "rm" under strace and logging it to a file that is attached to Maloo so that we can see if the problem is in readdir() data returned to userspace, etc. There have been 8 failures in the last 165 review-zfs tests in the past week, so I think if a patch is landed to sanity.sh on master it should be possible to get more information within a day or two. I don't think just running review testing on the patch itself is likely to see problems, unless we run sanity.sh 20x in a loop with Test-Parameters:. |
| Comment by nasf (Inactive) [ 30/Oct/14 ] |
|
Isaac, another possible condition is that our Maloo test clusters run a lots VMs on single node, so the system load on Maloo clusters should be much higher than our personal test environment. So if we can simulate the similar test environment as Maloo clusters do, then it may be helpful to reproduce the issue locally. |
| Comment by Isaac Huang (Inactive) [ 01/Dec/14 ] |
|
With a build on master, I'm now able to reproduce it almost 100%. I'll look into it. |
| Comment by Isaac Huang (Inactive) [ 02/Dec/14 ] |
|
In one test today, test_123a took 18127 seconds to complete (more than 100x the usual time on the same hw), and there were 11575754 "no such file" errors in the test log. |
| Comment by Andreas Dilger [ 16/Dec/14 ] |
|
Is it possible that this problem is the same as The most recent failure https://testing.hpdd.intel.com/test_sets/97ea0fd0-84f6-11e4-a60f-5254006e85c2 was on a patch based on a tree that doesn't contain the Before that, the most recent test failure was https://testing.hpdd.intel.com/test_sets/863b9434-fcb2-11e2-9222-52540035b04c on 2014-08-03. |
| Comment by Andreas Dilger [ 21/Dec/14 ] |
|
Saw this again on a recent patch run: |
| Comment by Andreas Dilger [ 14/Jan/15 ] |
|
This might also be related to |
| Comment by Andreas Dilger [ 14/Jan/15 ] |
|
Haven't seen this again for the past 4 weeks. Closing it again. Maybe |