[LU-2620] Failure on test suite replay-ost-single test_6: test_6 failed with 1 Created: 15/Jan/13 Updated: 05/Sep/13 Resolved: 07/Mar/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.1.4, Lustre 1.8.9 |
| Fix Version/s: | Lustre 2.4.0, Lustre 2.1.5 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 6130 | ||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/478c299a-5ef8-11e2-b507-52540035b04c. The sub-test test_6 failed with the following error:
== replay-ost-single test 6: Fail OST before obd_destroy == 17:27:58 (1358126878) Waiting for orphan cleanup... CMD: client-32vm3 /usr/sbin/lctl get_param -n osp.*osc*.old_sync_processed Waiting for local destroys to complete 1280+0 records in 1280+0 records out 5242880 bytes (5.2 MB) copied, 0.970226 s, 5.4 MB/s /mnt/lustre/d0.replay-ost-single/f.replay-ost-single.6 lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_layout_gen: 0 lmm_stripe_offset: 0 obdidx objid objid group 0 193 0xc1 0 CMD: client-32vm3 lctl set_param fail_loc=0x80000119 fail_loc=0x80000119 before: 12650184 after_dd: 13693644 replay-ost-single test_6: @@@@@@ FAIL: test_6 failed with 1 |
| Comments |
| Comment by Andreas Dilger [ 15/Jan/13 ] |
|
I think this problem may have been induced by the |
| Comment by Bob Glossman (Inactive) [ 15/Jan/13 ] |
|
So far haven't been able to reproduce this failure locally. If indeed it is due to some problem with http://review.whamcloud.com/4885, for example returning from the function wait_mds_ost_sync() too quickly, shouldn't that be causing failures in all tests that use this function? This seems the only failure reported. |
| Comment by Bob Glossman (Inactive) [ 16/Jan/13 ] |
|
Seems like the preceding test 5 must run in order to trigger the problem. I note that it did run in the failing test set. With the SLOW=no default it is skipped. I think that's why I was having difficulty reproducing the problem. |
| Comment by Bob Glossman (Inactive) [ 16/Jan/13 ] |
|
bzzz, in comments in |
| Comment by Alex Zhuravlev [ 16/Jan/13 ] |
|
yes, this represents only old requests (left from the previous boot). mds_sync (repesenting u.filter.fo_mds_ost_sync in obdfilter) can't go to 0 as well, as MDS-OST recovery happens once at the startup. |
| Comment by Bob Glossman (Inactive) [ 16/Jan/13 ] |
|
In discussion with bzzz he suggests using the existing function wait_delete_completed_mds() to force the previous rm to finish before moving on. Apparently this kind of failure isn't new and not due to the fix for |
| Comment by Bob Glossman (Inactive) [ 16/Jan/13 ] |
| Comment by Sarah Liu [ 18/Feb/13 ] |
|
another failure seen in zfs: https://maloo.whamcloud.com/test_sets/c34fa2a2-7788-11e2-987d-52540035b04c |
| Comment by Jian Yu [ 25/Feb/13 ] |
|
Lustre b2_1 client build: http://build.whamcloud.com/job/lustre-b2_1/176 The same failure occurred: https://maloo.whamcloud.com/test_sets/93369ad0-7d78-11e2-85d0-52540035b04c |
| Comment by Andreas Dilger [ 05/Mar/13 ] |
|
This is still failing in master several times every day: |
| Comment by Bob Glossman (Inactive) [ 05/Mar/13 ] |
|
I think the failures being seen now look different. The original reported bug showed only test_6 failing, and that only after running test_5 due to SLOW=yes. New failures now show test_6 and following tests all failing. Possibly an entirely new underlying cause. |
| Comment by Andreas Dilger [ 05/Mar/13 ] |
|
Looks like all of the new failures are on review-zfs test runs, so it may be due to a different cause? |
| Comment by Bob Glossman (Inactive) [ 05/Mar/13 ] |
|
Also original failure was seen regardless of fstype. New ones appear to be only on zfs, if I'm not mistaken. |
| Comment by Andreas Dilger [ 05/Mar/13 ] |
|
Let's task |
| Comment by Bob Glossman (Inactive) [ 05/Mar/13 ] |
|
Andreas, Looks like we are on the same page suspecting a different cause. |
| Comment by Jian Yu [ 07/Mar/13 ] |
|
Hello Oleg, Could you please cherry-pick the patch of http://review.whamcloud.com/5042 to Lustre b2_1 branch since the failure occurs in 2.1.4<->2.4.0 interop testing? Thanks. |
| Comment by Peter Jones [ 07/Mar/13 ] |
|
closing again as the new issue is being tracked under |
| Comment by Sarah Liu [ 26/Mar/13 ] |
|
Hit this bug in interop between 1.8.9 client and 2.4 server, the server build is #1338 which should include the fix of https://maloo.whamcloud.com/test_sets/b683542a-948d-11e2-93c6-52540035b04c |
| Comment by Bob Glossman (Inactive) [ 26/Mar/13 ] |
|
I think this failure is expected. The patch of http://review.whamcloud.com/5042 was cherry picked to b2_1, which fixed the problem for 2.1/2.4 interop. This was never done for b1_8 as far as I can see so the problem was never fixed for 1.8.9/2.4 interop. |
| Comment by Jian Yu [ 05/Sep/13 ] |
|
Lustre client: http://build.whamcloud.com/job/lustre-b1_8/258/ (1.8.9-wc1) replay-ost-single test 6 hit the same failure: |