[LU-2620] Failure on test suite replay-ost-single test_6: test_6 failed with 1 Created: 15/Jan/13  Updated: 05/Sep/13  Resolved: 07/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.4, Lustre 1.8.9
Fix Version/s: Lustre 2.4.0, Lustre 2.1.5

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Bob Glossman (Inactive)
Resolution: Fixed Votes: 0
Labels: HB

Issue Links:
Duplicate
is duplicated by LU-1966 Test failure on test suite replay-ost... Resolved
Related
is related to LU-2494 error: get_param: /proc/{fs,sys}/{lne... Resolved
is related to LU-2874 Test timeout failure on test suite re... Resolved
is related to LU-2011 Test failure on test suite replay-ost... Resolved
Severity: 3
Rank (Obsolete): 6130

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/478c299a-5ef8-11e2-b507-52540035b04c.

The sub-test test_6 failed with the following error:

test_6 failed with 1

== replay-ost-single test 6: Fail OST before obd_destroy == 17:27:58 (1358126878)
Waiting for orphan cleanup...
CMD: client-32vm3 /usr/sbin/lctl get_param -n osp.*osc*.old_sync_processed
Waiting for local destroys to complete
1280+0 records in
1280+0 records out
5242880 bytes (5.2 MB) copied, 0.970226 s, 5.4 MB/s
/mnt/lustre/d0.replay-ost-single/f.replay-ost-single.6
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  0
	obdidx		 objid		objid		 group
	     0	           193	         0xc1	             0

CMD: client-32vm3 lctl set_param fail_loc=0x80000119
fail_loc=0x80000119
before: 12650184 after_dd: 13693644
 replay-ost-single test_6: @@@@@@ FAIL: test_6 failed with 1 


 Comments   
Comment by Andreas Dilger [ 15/Jan/13 ]

I think this problem may have been induced by the LU-2494 landing of http://review.whamcloud.com/4885.

Comment by Bob Glossman (Inactive) [ 15/Jan/13 ]

So far haven't been able to reproduce this failure locally.

If indeed it is due to some problem with http://review.whamcloud.com/4885, for example returning from the function wait_mds_ost_sync() too quickly, shouldn't that be causing failures in all tests that use this function? This seems the only failure reported.

Comment by Bob Glossman (Inactive) [ 16/Jan/13 ]

Seems like the preceding test 5 must run in order to trigger the problem. I note that it did run in the failing test set. With the SLOW=no default it is skipped. I think that's why I was having difficulty reproducing the problem.

Comment by Bob Glossman (Inactive) [ 16/Jan/13 ]

bzzz, in comments in LU-2494 you said that old_sync_processed was the correct new flag to use. However I can't see it ever going to 0 when I do a sync command on the client. Seems like once it gets set to 1 near the beginning of time it stays 1 forever. If so this makes it not a good analog for the mds_sync variable in older versions. If this is true then Andreas is correct and my fix for LU-2494 probably led to this problem.

Comment by Alex Zhuravlev [ 16/Jan/13 ]

yes, this represents only old requests (left from the previous boot). mds_sync (repesenting u.filter.fo_mds_ost_sync in obdfilter) can't go to 0 as well, as MDS-OST recovery happens once at the startup.

Comment by Bob Glossman (Inactive) [ 16/Jan/13 ]

In discussion with bzzz he suggests using the existing function wait_delete_completed_mds() to force the previous rm to finish before moving on. Apparently this kind of failure isn't new and not due to the fix for LU-2494. The existing calls in test_6 of wait_mds_ost_sync and wait_destroy_complete and sync aren't sufficient to guarantee that kbytesavail is stable before starting the test. Investigating this solution. If it works in local test I will work up a patch presently.

Comment by Bob Glossman (Inactive) [ 16/Jan/13 ]

http://review.whamcloud.com/5042

Comment by Sarah Liu [ 18/Feb/13 ]

another failure seen in zfs: https://maloo.whamcloud.com/test_sets/c34fa2a2-7788-11e2-987d-52540035b04c

Comment by Jian Yu [ 25/Feb/13 ]

Lustre b2_1 client build: http://build.whamcloud.com/job/lustre-b2_1/176
Lustre master server build: http://build.whamcloud.com/job/lustre-master/1269
Distro/Arch: RHEL6.3/x86_64

The same failure occurred: https://maloo.whamcloud.com/test_sets/93369ad0-7d78-11e2-85d0-52540035b04c

Comment by Andreas Dilger [ 05/Mar/13 ]

This is still failing in master several times every day:
https://maloo.whamcloud.com/sub_tests/7b9df70e-8531-11e2-bfd3-52540035b04c
https://maloo.whamcloud.com/sub_tests/3d973ca8-8532-11e2-9ab1-52540035b04c
https://maloo.whamcloud.com/sub_tests/35cfc004-8513-11e2-9ab1-52540035b04c
https://maloo.whamcloud.com/sub_tests/7e5e9684-8512-11e2-bfd3-52540035b04c

Comment by Bob Glossman (Inactive) [ 05/Mar/13 ]

I think the failures being seen now look different. The original reported bug showed only test_6 failing, and that only after running test_5 due to SLOW=yes. New failures now show test_6 and following tests all failing. Possibly an entirely new underlying cause.

Comment by Andreas Dilger [ 05/Mar/13 ]

Looks like all of the new failures are on review-zfs test runs, so it may be due to a different cause?

Comment by Bob Glossman (Inactive) [ 05/Mar/13 ]

Also original failure was seen regardless of fstype. New ones appear to be only on zfs, if I'm not mistaken.

Comment by Andreas Dilger [ 05/Mar/13 ]

Let's task LU-2903 for this new failure, since it does appear that this one was only hit when test_5 was being run.

Comment by Bob Glossman (Inactive) [ 05/Mar/13 ]

Andreas, Looks like we are on the same page suspecting a different cause.

Comment by Jian Yu [ 07/Mar/13 ]

Hello Oleg,

Could you please cherry-pick the patch of http://review.whamcloud.com/5042 to Lustre b2_1 branch since the failure occurs in 2.1.4<->2.4.0 interop testing? Thanks.

Comment by Peter Jones [ 07/Mar/13 ]

closing again as the new issue is being tracked under LU-2903

Comment by Sarah Liu [ 26/Mar/13 ]

Hit this bug in interop between 1.8.9 client and 2.4 server, the server build is #1338 which should include the fix of LU-2903

https://maloo.whamcloud.com/test_sets/b683542a-948d-11e2-93c6-52540035b04c

Comment by Bob Glossman (Inactive) [ 26/Mar/13 ]

I think this failure is expected. The patch of http://review.whamcloud.com/5042 was cherry picked to b2_1, which fixed the problem for 2.1/2.4 interop. This was never done for b1_8 as far as I can see so the problem was never fixed for 1.8.9/2.4 interop.

Comment by Jian Yu [ 05/Sep/13 ]

Lustre client: http://build.whamcloud.com/job/lustre-b1_8/258/ (1.8.9-wc1)
Lustre server: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1)

replay-ost-single test 6 hit the same failure:
https://maloo.whamcloud.com/test_sets/6c0c8652-15c3-11e3-87cb-52540035b04c

Generated at Sat Feb 10 01:26:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.