[LU-8136] sanity-hsm test_9 fails with 'request on 0x200000405:0x4:0x0 is not SUCCEED on mds1' - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.9.0
Affects Version/s: Lustre 2.9.0
Labels:
None
Environment:
autotest review-dne

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

sanity-hsm test 9 fails with

'request on 0x200000405:0x4:0x0 is not SUCCEED on mds1'

The last thing seen in the test log before the failure is

CMD: trevis-5vm4 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: trevis-5vm4 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: trevis-5vm4 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
CMD: trevis-5vm4 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
Update not seen after 200s: wanted 'SUCCEED' got 'STARTED'
 sanity-hsm test_9: @@@@@@ FAIL: request on 0x200000405:0x4:0x0 is not SUCCEED on mds1 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4769:error()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:766:wait_request_state()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:1010:test_9()
  = /usr/lib64/lustre/tests/test-framework.sh:5033:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5072:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4919:run_test()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:1016:main()
Dumping lctl log to /logdir/test_logs/2016-05-11/lustre-reviews-el7-x86_64--review-dne-part-2--1_7_1__38816__-70227460739120-004004/sanity-hsm.test_9.*.1462941864.log

Is this the same or similar issue as in ~~LU-8111~~?

So far, this test is only failing on review-dne-* tests groups. Test 9 started failing with this failure in the past two days;7 failures. Here are the failures:
2016-05-10 - https://testing.hpdd.intel.com/test_sets/dd4e43d0-168a-11e6-855a-5254006e85c2
2016-05-10 - https://testing.hpdd.intel.com/test_sets/ab71c010-16d2-11e6-855a-5254006e85c2
2016-05-10 - https://testing.hpdd.intel.com/test_sets/6f746a34-1710-11e6-855a-5254006e85c2
2016-05-10 - https://testing.hpdd.intel.com/test_sets/ae6d43d4-1713-11e6-9b34-5254006e85c2
2016-05-11 - https://testing.hpdd.intel.com/test_sets/541a3da0-1732-11e6-b5f1-5254006e85c2
2016-05-11 - https://testing.hpdd.intel.com/test_sets/0923a158-173f-11e6-855a-5254006e85c2
2016-05-11 - https://testing.hpdd.intel.com/test_sets/6f95b094-1757-11e6-9b34-5254006e85c2

Attachments

Activity

[LU-8136] sanity-hsm test_9 fails with 'request on 0x200000405:0x4:0x0 is not SUCCEED on mds1'

Gerrit Updater added a comment - 30/Dec/16 9:31 AM

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/24542
Subject: ~~LU-8136~~ tests: ensure CT is registered with all MDTs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8117260d68c2fd5b6107d30f163fec89ac4ba317

Gerrit Updater added a comment - 30/Dec/16 9:31 AM Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/24542 Subject: LU-8136 tests: ensure CT is registered with all MDTs Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8117260d68c2fd5b6107d30f163fec89ac4ba317

Bruno Faccini (Inactive) added a comment - 30/Dec/16 9:30 AM

Last, concerning the generalized need of delay to allow CT to register with all MDTs, after I have reviewed the recent auto-tests failures I still think it is not required and was only, as part of this ticket, for sanity-hsm/test_9 due to being the only sub-test using copytool_setup() and "lfs hsm_archive" in a raw, without any other cmd in-between to give enough time for CT's full registering.

But anyway, I will push a patch to implement this in copytool_setup().

Bruno Faccini (Inactive) added a comment - 30/Dec/16 9:30 AM Last, concerning the generalized need of delay to allow CT to register with all MDTs, after I have reviewed the recent auto-tests failures I still think it is not required and was only, as part of this ticket, for sanity-hsm/test_9 due to being the only sub-test using copytool_setup() and "lfs hsm_archive" in a raw, without any other cmd in-between to give enough time for CT's full registering. But anyway, I will push a patch to implement this in copytool_setup().

Bruno Faccini (Inactive) added a comment - 29/Dec/16 10:28 PM

Hello Dmitry, I am happy to have helped.
But concerning my previous patch proposal, I don't think it can be considered as a definitive fix since it introduces a regression by no longer providing the wrong/unexpected status in case of a real failure (vs handling multiple failed requests and one that succeed, in your case).
In your case/patch, can't you detect that the file needs to be restored and launch+wait for it to be available/online before starting your splitted I/Os ?

Bruno Faccini (Inactive) added a comment - 29/Dec/16 10:28 PM Hello Dmitry, I am happy to have helped. But concerning my previous patch proposal, I don't think it can be considered as a definitive fix since it introduces a regression by no longer providing the wrong/unexpected status in case of a real failure (vs handling multiple failed requests and one that succeed, in your case). In your case/patch, can't you detect that the file needs to be restored and launch+wait for it to be available/online before starting your splitted I/Os ?

Dmitry Eremin (Inactive) added a comment - 29/Dec/16 2:11 PM

Thanks Bruno,

You are absolutely right! HSM reads the data by 4Mb chunks even small files. Therefore we can get many callbacks if will read by small portions. Now the size of test file is less than stripe size and we always get single replay. But if the size of test file will be more than stripe size or the read will be by small portions like happens in my patch we will have several replays and tests will fail. Your fix resolve this issue. Thanks again.

Dmitry Eremin (Inactive) added a comment - 29/Dec/16 2:11 PM Thanks Bruno, You are absolutely right! HSM reads the data by 4Mb chunks even small files. Therefore we can get many callbacks if will read by small portions. Now the size of test file is less than stripe size and we always get single replay. But if the size of test file will be more than stripe size or the read will be by small portions like happens in my patch we will have several replays and tests will fail. Your fix resolve this issue. Thanks again.

Dmitry Eremin (Inactive) added a comment - 29/Dec/16 12:05 PM

Thanks Bruno,

I don't understand how my patch can affect this functionality but I will look into this. I parallelize only regular I/O. Other I/O should use old pipeline. Even with parallel I/O we should not have multiple requests.

Dmitry Eremin (Inactive) added a comment - 29/Dec/16 12:05 PM Thanks Bruno, I don't understand how my patch can affect this functionality but I will look into this. I parallelize only regular I/O. Other I/O should use old pipeline. Even with parallel I/O we should not have multiple requests.

Bruno Faccini (Inactive) added a comment - 29/Dec/16 10:42 AM - edited

Dmitry,
I have checked the recent auto-tests results in Maloo and seems that all recent failures of sanity-hsm test_12*, test_33-36, test_57-58, test_110*, test_222* subtests have only occurred during the tests sessions for your patches for ~~LU-8964~~ and also a few with patch for ~~LU-8709~~, and always during implicit restore operations where their result interpretation leads to multiple statuses/lines (due to multiple requests??...).
If I understand correctly both tickets/patches intent is to optimize/parallelize Client's regular/read-ahead I/O ops, so could it be possible that your code/patch, can cause multiple hsm_restore requests to be generated for the same file/fid, when only one will succeed ??
If yes, we may only need to strengthen sanity-hsm/wait_request_state() function to filter uniquely the wanted status in its check command :

[root@eagle-31 lustre-release]# diff -urpN /usr/lib64/lustre/tests/sanity-hsm.sh.bfi /usr/lib64/lustre/tests/sanity-hsm.sh.bfi+
--- /usr/lib64/lustre/tests/sanity-hsm.sh.bfi   2016-12-28 16:41:09.000000000 +0000
+++ /usr/lib64/lustre/tests/sanity-hsm.sh.bfi+  2016-12-29 10:40:08.000000000 +0000
@@ -724,7 +724,8 @@ wait_request_state() {
        local mds=mds$(($mdtidx + 1))
 
        local cmd="$LCTL get_param -n ${MDT_PREFIX}${mdtidx}.hsm.actions"
-       cmd+=" | awk '/'$fid'.*action='$request'/ {print \\\$13}' | cut -f2 -d="
+       cmd+=" | awk '/'$fid'.*action='$request'/ {print \\\$13}' |\
+            cut -f2 -d= | uniq | grep $state"
 
        wait_result $mds "$cmd" $state 200 ||
                error "request on $fid is not $state on $mds"
[root@eagle-31 lustre-release]#

What do you think ??

Bruno Faccini (Inactive) added a comment - 29/Dec/16 10:42 AM - edited Dmitry, I have checked the recent auto-tests results in Maloo and seems that all recent failures of sanity-hsm test_12*, test_33-36, test_57-58, test_110*, test_222* subtests have only occurred during the tests sessions for your patches for LU-8964 and also a few with patch for LU-8709 , and always during implicit restore operations where their result interpretation leads to multiple statuses/lines (due to multiple requests??...). If I understand correctly both tickets/patches intent is to optimize/parallelize Client's regular/read-ahead I/O ops, so could it be possible that your code/patch, can cause multiple hsm_restore requests to be generated for the same file/fid, when only one will succeed ?? If yes, we may only need to strengthen sanity-hsm/wait_request_state() function to filter uniquely the wanted status in its check command : [root@eagle-31 lustre-release]# diff -urpN /usr/lib64/lustre/tests/sanity-hsm.sh.bfi /usr/lib64/lustre/tests/sanity-hsm.sh.bfi+ --- /usr/lib64/lustre/tests/sanity-hsm.sh.bfi 2016-12-28 16:41:09.000000000 +0000 +++ /usr/lib64/lustre/tests/sanity-hsm.sh.bfi+ 2016-12-29 10:40:08.000000000 +0000 @@ -724,7 +724,8 @@ wait_request_state() { local mds=mds$(($mdtidx + 1)) local cmd="$LCTL get_param -n ${MDT_PREFIX}${mdtidx}.hsm.actions" - cmd+=" | awk '/'$fid'.*action='$request'/ {print \\\$13}' | cut -f2 -d=" + cmd+=" | awk '/'$fid'.*action='$request'/ {print \\\$13}' |\ + cut -f2 -d= | uniq | grep $state" wait_result $mds "$cmd" $state 200 || error "request on $fid is not $state on $mds" [root@eagle-31 lustre-release]# What do you think ??

Bruno Faccini (Inactive) added a comment - 28/Dec/16 10:35 AM - edited

Well having a look to the recent failed auto-tests logs, it seems that their problem is not the one (CT failing to register with all MDTs in a too short time, requiring to add a tempo) that has been tracked in this ticket.
My feeling comes from the same set of following lines :

CMD: trevis-35vm7 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Changed after 16s: from 'SUCCEED
FAILED' to ''

that can be found in each of these tests main log showing that the SUCCEED request state may not have been detected due to some (new?) issue in the wait_request_state() function when interpreting "hsm/actions" proc file output.

Bruno Faccini (Inactive) added a comment - 28/Dec/16 10:35 AM - edited Well having a look to the recent failed auto-tests logs, it seems that their problem is not the one (CT failing to register with all MDTs in a too short time, requiring to add a tempo) that has been tracked in this ticket. My feeling comes from the same set of following lines : CMD: trevis-35vm7 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0xf:0x0'.*action='RESTORE'/ {print \$13}' | cut -f2 -d= Changed after 16s: from 'SUCCEED FAILED' to '' that can be found in each of these tests main log showing that the SUCCEED request state may not have been detected due to some (new?) issue in the wait_request_state() function when interpreting "hsm/actions" proc file output.

Bruno Faccini (Inactive) added a comment - 28/Dec/16 9:30 AM

Well, may be the tempo and CT registration verification should be, as John H had already suggested when commenting my first patch, generalized (in copytool_setup()?).
I will check the auto-tests links you have provided and see if they show the same behavior than the original ones.

Bruno Faccini (Inactive) added a comment - 28/Dec/16 9:30 AM Well, may be the tempo and CT registration verification should be, as John H had already suggested when commenting my first patch, generalized (in copytool_setup()?). I will check the auto-tests links you have provided and see if they show the same behavior than the original ones.

Dmitry Eremin (Inactive) added a comment - 28/Dec/16 8:31 AM - edited

The same failure now happens with test_12*, test_33-36, test_57-58, test_110*, test_222*

https://testing.hpdd.intel.com/test_sets/7c0d8752-cc8d-11e6-9816-5254006e85c2

https://testing.hpdd.intel.com/test_sets/85ff554c-ccd8-11e6-9296-5254006e85c2

Dmitry Eremin (Inactive) added a comment - 28/Dec/16 8:31 AM - edited The same failure now happens with test_12*, test_33-36, test_57-58, test_110*, test_222* https://testing.hpdd.intel.com/test_sets/7c0d8752-cc8d-11e6-9816-5254006e85c2 https://testing.hpdd.intel.com/test_sets/85ff554c-ccd8-11e6-9296-5254006e85c2

Peter Jones added a comment - 16/Jun/16 2:47 PM

Landed for 2.9

Peter Jones added a comment - 16/Jun/16 2:47 PM Landed for 2.9

Gerrit Updater added a comment - 02/Jun/16 4:45 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20258/
Subject: ~~LU-8136~~ tests: allow CT to registers with all MDTs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f56b701235674c8661691d73f55570c161fa13cb

Gerrit Updater added a comment - 02/Jun/16 4:45 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20258/ Subject: LU-8136 tests: allow CT to registers with all MDTs Project: fs/lustre-release Branch: master Current Patch Set: Commit: f56b701235674c8661691d73f55570c161fa13cb

People

Assignee:: WC Triage

Reporter:: James Nunez (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 12/May/16 8:50 PM

Updated:: 08/Nov/18 7:36 AM

Resolved:: 08/Nov/18 7:36 AM