[LU-4613] Failure on test suite sanity-hsm test_12o: request on 0x200000bd1:0xf:0x0 is not SUCCEED on mds1 Created: 11/Feb/14  Updated: 21/Feb/14  Resolved: 21/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.5.1
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: HSM
Environment:

client and server: lustre-master build # 1876
client is SLES11 SP3


Severity: 3
Rank (Obsolete): 12623

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/6ac389ea-90ce-11e3-91ee-52540035b04c.

The sub-test test_12o failed with the following error:

request on 0x200000bd1:0xf:0x0 is not SUCCEED on mds1

Info required for matching: sanity-hsm 12o



 Comments   
Comment by Bruno Faccini (Inactive) [ 12/Feb/14 ]

+1 for https://maloo.whamcloud.com/test_sets/e17437d2-934f-11e3-9f1b-52540035b04c, during auto-tests session for patch http://review.whamcloud.com/9212 (b2_5 version for LU-3834).

sanity-hsm/test_12o (I introduced for LU-3834 !!…) reports a failure waiting for the success of the 2nd diff command (ie, when fault-injection of layouts-swap has been disabled) for the same file/FID, and we can see the following traces in the tests log :

………

CMD: client-12vm3 lctl set_param fail_loc=0x152
fail_loc=0x152
CMD: client-12vm3 /usr/sbin/lctl set_param mdt.lustre-MDT0000.hsm.policy=+NRA
mdt.lustre-MDT0000.hsm.policy=+NRA
diff: /mnt/lustre/d12o.sanity-hsm/f12o.sanity-hsm: No data available
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl set_param mdt.lustre-MDT0000.hsm.policy=-NRA
mdt.lustre-MDT0000.hsm.policy=-NRA
CMD: client-12vm3 lctl set_param fail_loc=0
fail_loc=0
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 100 secs for update
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 90 secs for update
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Changed after 17s: from 'FAILED
SUCCEED' to ''
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=

……….

so, seems that since it searches for hsm operation result for the same FID than a previous failed operation that has may not have been purge from the log, it gets puzzled when receiving both old/new results!

Here is sanity-hsm/test_12o current code :

test_12o() {
        # test needs a running copytool
        copytool_setup

        mkdir -p $DIR/$tdir
        local f=$DIR/$tdir/$tfile
        local fid=$(copy_file /etc/hosts $f)

        $LFS hsm_archive --archive $HSM_ARCHIVE_NUMBER $f
        wait_request_state $fid ARCHIVE SUCCEED
        $LFS hsm_release $f || error "release of $f failed"

#define OBD_FAIL_MDS_HSM_SWAP_LAYOUTS           0x152
        do_facet $SINGLEMDS lctl set_param fail_loc=0x152

        # set no retry action mode
        cdt_set_no_retry

        diff -q /etc/hosts $f
        local st=$?

        # we check we had a restore failure
        wait_request_state $fid RESTORE FAILED

        [[ $st -eq 0 ]] && error "Restore must fail"

        # remove no retry action mode
        cdt_clear_no_retry

        # check file is still released
        check_hsm_flags $f "0x0000000d"

        # retry w/o failure injection
        do_facet $SINGLEMDS lctl set_param fail_loc=0

        diff -q /etc/hosts $f
        st=$?

        # we check we had a restore done
        wait_request_state $fid RESTORE SUCCEED

        [[ $st -eq 0 ]] || error "Restored file differs"

        copytool_cleanup
}
run_test 12o "Layout-swap failure during Restore leaves file released"

I will push a patch to introduce a "cdt_purge + wait_for_grace_delay" between both the failing (due to fault-injection) and succeeding (normal conditions) RESTORE attempts, to allow old/1st result to disapear from the log. Current grace_delay is set to 10s by sanity-hsm which is too long between both RESTORE requests here.

Comment by Bruno Faccini (Inactive) [ 12/Feb/14 ]

Patch is at http://review.whamcloud.com/9235.

Comment by Jian Yu [ 18/Feb/14 ]

While testing patch http://review.whamcloud.com/9288 on Lustre b2_5 branch, the same failure occurred:
https://maloo.whamcloud.com/test_sets/a20caf5e-984d-11e3-8a38-52540035b04c

As per Bruno, this regression failure was introduced by the patch for LU-3834 landed on Lustre b2_5 build #25.

So, the failure is also needed to be fixed on Lustre b2_5 branch.

Comment by Jian Yu [ 18/Feb/14 ]

Patch for b2_5 branch: http://review.whamcloud.com/9295

Comment by Bob Glossman (Inactive) [ 18/Feb/14 ]

hit again:
https://maloo.whamcloud.com/test_sets/d436c73c-9855-11e3-8828-52540035b04c

Comment by Peter Jones [ 21/Feb/14 ]

Landed for 2.5.1 and 2.6

Generated at Sat Feb 10 01:44:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.