ZFS OSD missing OI iterator for scrub/LFSCK (LU-3489)

[LU-4202] Test failure sanity-lfsck test_8: Expect 'scanning-phase2', but got 'scanning-phase1' Created: 04/Nov/13  Updated: 22/Sep/14  Resolved: 08/May/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0

Type: Technical task Priority: Major
Reporter: Maloo Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: prz, zfs

Issue Links:
Related
is related to LU-4556 speed up sanity-lfsck and sanity-scru... Resolved
Rank (Obsolete): 11421

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/b6717d88-4462-11e3-8472-52540035b04c.

The sub-test test_8 failed with the following error:

(22) Expect 'scanning-phase2', but got 'scanning-phase1'

Info required for matching: sanity-lfsck 8



 Comments   
Comment by Andreas Dilger [ 05/Nov/13 ]

Is this only failing tests when the ZFS scrub code is enabled via http://review.whamcloud.com/7149 so this isn't actually a failure on the current master branch

We still want to land that patch, but only once this problem is fixed. Is it possible that the ZFS scrub is just slower than ldiskfs?

Comment by nasf (Inactive) [ 06/Nov/13 ]

https://maloo.whamcloud.com/test_sets/b6717d88-4462-11e3-8472-52540035b04c

sanity-lfsck test_9a failed also:
Error: '(7) Expect 'completed', but got 'scanning-phase1''

The ldiskfs scrub speed in my local VM environment is not slower than 5'000 objs/sec, even if ZFS scrub speed is only 1/10 of the ldiskfs, it should finished in time.

Comment by Alex Zhuravlev [ 06/Nov/13 ]

hmm, it passes on my local setup consistently..

Comment by Alex Zhuravlev [ 06/Nov/13 ]

I changed number of objects in test 8 to 20K and run the test again:

average_speed_phase1: 5254 items/sec
average_speed_phase2: N/A
real-time_speed_phase1: 10445 items/sec
real-time_speed_phase2: N/A
current_position: 10287, [0x200000400:0x277b:0x0], 6445470245971689472
sanity-lfsck test_8: @@@@@@ FAIL: (22) Expect 'scanning-phase2', but got 'scanning-phase1'

Comment by Alex Zhuravlev [ 06/Nov/13 ]

the changed patch is submitted for testing: it should dump stats upon a failure.

Comment by Jian Yu [ 26/Mar/14 ]

One more instance on master branch with FSTYPE=zfs:
https://maloo.whamcloud.com/test_sets/24843856-b48e-11e3-8272-52540035b04c

Comment by nasf (Inactive) [ 27/Mar/14 ]

More failure instance:

https://maloo.whamcloud.com/test_sessions/a7e15cfa-b525-11e3-8598-52540035b04c
https://maloo.whamcloud.com/test_sessions/05cc2dc6-b51d-11e3-b2ed-52540035b04c
https://maloo.whamcloud.com/test_sets/5f9bd3a0-b52d-11e3-b716-52540035b04c
https://maloo.whamcloud.com/test_sessions/f69ba95c-b53b-11e3-b2ed-52540035b04c

Comment by Andreas Dilger [ 18/Apr/14 ]

I'm working on a patch to make sanity-lfsck use wait_update_facet() for most of the checks, so that the test will be more tolerant for slowdowns in the tests.

Comment by nasf (Inactive) [ 18/Apr/14 ]

Andreas, the replacement sleep with wait_update_facet() has been done in the patch:
http://review.whamcloud.com/9704

Comment by Jodi Levi (Inactive) [ 08/May/14 ]

Patch landed to Master. Please reopen ticket if more work is needed.

Comment by Jessica A. Popp (Inactive) [ 11/Jun/14 ]

Is this a relevant fix for getting landed on b2_5?

Comment by nasf (Inactive) [ 12/Jun/14 ]

The patch http://review.whamcloud.com/9704 is valuable for b2_5 also, but it cannot be directly applied on b2_5.

Comment by Jian Yu [ 20/Sep/14 ]

The back-ported patches are tracked under LU-5241.

Generated at Sat Feb 10 01:40:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.