[LU-5075] Test failure on test suite sanity-lfsck, subtest test_8 Fail to start LFSCK for namespace! Created: 17/May/14  Updated: 25/Aug/14  Resolved: 25/Aug/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-5143 Test failure on test suite sanity-lfs... Resolved
Severity: 3
Rank (Obsolete): 14009

 Description   

This issue was created by maloo for wangdi <di.wang@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/ffe8ec08-dd4b-11e3-9f1e-52540035b04c.

The sub-test test_8 failed with the following error:

(21) Fail to start LFSCK for namespace!

Info required for matching: sanity-lfsck 8



 Comments   
Comment by Andreas Dilger [ 20/May/14 ]

Looks like MDT0001 was evicted by MDT0000 just before the LFSCK started, which caused it to fail.

Comment by nasf (Inactive) [ 31/Jul/14 ]

The eviction is not the root reason for test_8 failure, because test_8 only start namespace LFSCK on the MDT0, so as long as MDT0 works well, the test should work.

        echo "stop $SINGLEMDS"
        stop $SINGLEMDS > /dev/null || error "(18) Fail to stop MDS!"

        #define OBD_FAIL_LFSCK_NO_AUTO          0x160b
(1) ===>        do_facet $SINGLEMDS $LCTL set_param fail_loc=0x160b

        echo "start $SINGLEMDS"
(2) ===>        start $SINGLEMDS $MDT_DEVNAME $MOUNT_OPTS_SCRUB > /dev/null ||
                error "(19) Fail to start MDS!"

(3)===>        STATUS=$($SHOW_NAMESPACE | awk '/^status/ { print $2 }')
        [ "$STATUS" == "paused" ] ||
                error "(20) Expect 'paused', but got '$STATUS'"

        #define OBD_FAIL_LFSCK_DELAY3           0x1602
(4)===>        do_facet $SINGLEMDS $LCTL set_param fail_val=2 fail_loc=0x1602

(5)===>        $START_NAMESPACE || error "(21) Fail to start LFSCK for namespace!"

As shown above, before restart the MDS at point (2), the test sets the fail_loc=0x160b at point (1) to prevent the LFSCK to be restarted automatically. But the LFSCK restart will be called after the recovery. There is race condition that when the test moves to the point (3), the recovery is not completed yet, so the check at the point (3) is passed, and then the tests resets the fail_loc=0x1602 at the point (4), after such reset, the recovery completed, because the fail_loc has been reset, so the LFSCK can be restarted automatically successfully, so when the test moves to the point (5), it will find the LFSCK is in running already.

I will make patch to fix the race in the test scripts.

Comment by nasf (Inactive) [ 31/Jul/14 ]

Here is the patch:

http://review.whamcloud.com/11288

Comment by Jian Yu [ 19/Aug/14 ]

More instances on master branch:
https://testing.hpdd.intel.com/test_sets/bbef6d96-244b-11e4-abce-5254006e85c2
https://testing.hpdd.intel.com/test_sets/5d21bc2c-2754-11e4-84f2-5254006e85c2

This is blocking patch to pass review testing.

Comment by nasf (Inactive) [ 25/Aug/14 ]

The patch has been landed to master.

Generated at Sat Feb 10 01:48:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.