[LU-4613] Failure on test suite sanity-hsm test_12o: request on 0x200000bd1:0xf:0x0 is not SUCCEED on mds1 Created: 11/Feb/14 Updated: 21/Feb/14 Resolved: 21/Feb/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.5.1 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.1 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HSM | ||
| Environment: |
client and server: lustre-master build # 1876 |
||
| Severity: | 3 |
| Rank (Obsolete): | 12623 |
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/6ac389ea-90ce-11e3-91ee-52540035b04c. The sub-test test_12o failed with the following error:
Info required for matching: sanity-hsm 12o |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 12/Feb/14 ] |
|
+1 for https://maloo.whamcloud.com/test_sets/e17437d2-934f-11e3-9f1b-52540035b04c, during auto-tests session for patch http://review.whamcloud.com/9212 (b2_5 version for sanity-hsm/test_12o (I introduced for
………
CMD: client-12vm3 lctl set_param fail_loc=0x152
fail_loc=0x152
CMD: client-12vm3 /usr/sbin/lctl set_param mdt.lustre-MDT0000.hsm.policy=+NRA
mdt.lustre-MDT0000.hsm.policy=+NRA
diff: /mnt/lustre/d12o.sanity-hsm/f12o.sanity-hsm: No data available
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl set_param mdt.lustre-MDT0000.hsm.policy=-NRA
mdt.lustre-MDT0000.hsm.policy=-NRA
CMD: client-12vm3 lctl set_param fail_loc=0
fail_loc=0
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 100 secs for update
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 90 secs for update
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Changed after 17s: from 'FAILED
SUCCEED' to ''
CMD: client-12vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000bd1:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
……….
so, seems that since it searches for hsm operation result for the same FID than a previous failed operation that has may not have been purge from the log, it gets puzzled when receiving both old/new results! Here is sanity-hsm/test_12o current code : test_12o() {
# test needs a running copytool
copytool_setup
mkdir -p $DIR/$tdir
local f=$DIR/$tdir/$tfile
local fid=$(copy_file /etc/hosts $f)
$LFS hsm_archive --archive $HSM_ARCHIVE_NUMBER $f
wait_request_state $fid ARCHIVE SUCCEED
$LFS hsm_release $f || error "release of $f failed"
#define OBD_FAIL_MDS_HSM_SWAP_LAYOUTS 0x152
do_facet $SINGLEMDS lctl set_param fail_loc=0x152
# set no retry action mode
cdt_set_no_retry
diff -q /etc/hosts $f
local st=$?
# we check we had a restore failure
wait_request_state $fid RESTORE FAILED
[[ $st -eq 0 ]] && error "Restore must fail"
# remove no retry action mode
cdt_clear_no_retry
# check file is still released
check_hsm_flags $f "0x0000000d"
# retry w/o failure injection
do_facet $SINGLEMDS lctl set_param fail_loc=0
diff -q /etc/hosts $f
st=$?
# we check we had a restore done
wait_request_state $fid RESTORE SUCCEED
[[ $st -eq 0 ]] || error "Restored file differs"
copytool_cleanup
}
run_test 12o "Layout-swap failure during Restore leaves file released"
I will push a patch to introduce a "cdt_purge + wait_for_grace_delay" between both the failing (due to fault-injection) and succeeding (normal conditions) RESTORE attempts, to allow old/1st result to disapear from the log. Current grace_delay is set to 10s by sanity-hsm which is too long between both RESTORE requests here. |
| Comment by Bruno Faccini (Inactive) [ 12/Feb/14 ] |
|
Patch is at http://review.whamcloud.com/9235. |
| Comment by Jian Yu [ 18/Feb/14 ] |
|
While testing patch http://review.whamcloud.com/9288 on Lustre b2_5 branch, the same failure occurred: As per Bruno, this regression failure was introduced by the patch for So, the failure is also needed to be fixed on Lustre b2_5 branch. |
| Comment by Jian Yu [ 18/Feb/14 ] |
|
Patch for b2_5 branch: http://review.whamcloud.com/9295 |
| Comment by Bob Glossman (Inactive) [ 18/Feb/14 ] |
|
hit again: |
| Comment by Peter Jones [ 21/Feb/14 ] |
|
Landed for 2.5.1 and 2.6 |