[LU-2602] test: recovery-small 24b: test IGNORED and test log badly corrupted Created: 10/Jan/13 Updated: 12/Mar/14 Resolved: 12/Mar/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.1.3 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jay Lan (Inactive) | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Server: 2.1.3-1nasS, centos 6.3, 2.6.32_279.2.1.el6 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 6069 |
| Description |
|
This test is IGNORED. Is this result expected? Test logs tarball is attached. |
| Comments |
| Comment by Peter Jones [ 10/Jan/13 ] |
|
Bob Could you please look into this one? Thanks Peter |
| Comment by Bob Glossman (Inactive) [ 10/Jan/13 ] |
|
Jay, I'm a little confused. You say your client is 2.1.3, but as far as I can see the 2.1.3 version of recovery-small doesn't even have test 24b. It only has test_24. |
| Comment by Bob Glossman (Inactive) [ 10/Jan/13 ] |
|
I'm guessing that the bad log comes from the error_ignore function being called with the wrong number of arguments. It's supposed to be 'error_ignore bug-number error-string', but instead it seems to be just 'error_ignore string' in test 24b. This could also account for weird stuff in the log. I suspect this isn't seen a lot as the multiop command usually fails as it's supposed to in this test. To check this out could you just edit your recovery-small.sh script, add any number (12345 will do) as the first argument to the error_ignore line near the end of test_24b, run the test again and see what happens? |
| Comment by Jay Lan (Inactive) [ 10/Jan/13 ] |
|
Bob, I hit this problem with lustre client 2.3.0-2nasC, sles11sp2, 3.0.42_0.7.3 also. Please ignore the 2.1.3-SP2_1nasC_ofed154 client for sles11sp2. It is not a supported version. |
| Comment by Bob Glossman (Inactive) [ 10/Jan/13 ] |
|
That makes more sense. definitely a test 24b in 2.3 and newer versions of recovery-small. looks like the bogus error_ignore call is there too. |
| Comment by Bob Glossman (Inactive) [ 10/Jan/13 ] |
|
Sorry, spoke too soon. Just checked the tree. The 2.3 release version of the script still only has test 24. Only newer builds like current master will have 24b. The relevant change went in ~10/30/2012 |
| Comment by Jay Lan (Inactive) [ 10/Jan/13 ] |
|
Bob, add '12345' as the first argument did not help. However, if I change '12345' to '5494', the number used in 24a, then I got a clean log. I guess probably 12345 not a valid bug number? |
| Comment by Jay Lan (Inactive) [ 10/Jan/13 ] |
|
My git repo is at https://github.com/jlan/lustre-nas, |
| Comment by Bob Glossman (Inactive) [ 10/Jan/13 ] |
|
Thanks for going the extra mile. I think you have verified that the form of the error_ignore call is causing the problem. Knowing that I can do a 1 line fix. Not sure why it didn't like 12345. At first blush it seems like any number would be OK. There are examples of error_ignore calls with other 5 digit numbers in various scripts. |
| Comment by Jay Lan (Inactive) [ 10/Jan/13 ] |
|
Blame on commit cb102ce, which I picked up while picking up fix for |
| Comment by Bob Glossman (Inactive) [ 10/Jan/13 ] |
| Comment by Sarah Liu [ 21/Jan/13 ] |
|
Hit this issue in lustre-master tag-2.3.59 ofd build testing server/client lustre-master build# 1176 ofd build |
| Comment by Jay Lan (Inactive) [ 25/Mar/13 ] |
|
Was the failure reported by Sarah Liu reproducible? It not, it is not caused by commit checked in on http://review.whamcloud.com/4995. If this did not cause side effect, can we close this ticket? |
| Comment by Bob Glossman (Inactive) [ 25/Mar/13 ] |
|
Jay, Looking back in the failure history of this subtest I don't see any similar failures since 1/26. I'm willing to close the ticket if it's OK with you. |
| Comment by Jay Lan (Inactive) [ 25/Mar/13 ] |
|
Yes, let's close it. Thanks! |
| Comment by Jinshan Xiong (Inactive) [ 25/Oct/13 ] |
|
Test case 24b should be excluded for ldiskfs as well. It has tons of race and timing issue to make it fail. For example, the evict can happen before processes write the data; and "dirty page discard" can be skipped from dmesg occasionally; and the write can succeed before eviction if there is no grant on client because it will be finished with SYNC write. |
| Comment by John Fuchs-Chesney (Inactive) [ 12/Mar/14 ] |
|
Customer indicated that we can 'close' this issue. |