[LU-4178] Test failure on test suite sanity-hsm, subtest test_200 Created: 29/Oct/13  Updated: 31/Aug/17  Resolved: 25/Feb/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.6.0, Lustre 2.7.0, Lustre 2.8.0, Lustre 2.5.4

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: HSM

Issue Links:
Related
is related to LU-3852 sanity-hsm test_251: client26-vm "dd:... Resolved
Severity: 3
Rank (Obsolete): 11309

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/4c3bcdec-4025-11e3-bfaf-52540035b04c.

test_200 was only recently enabled by commit 38695729d61958ab10e9e108175298f8a7d40536. before that is was always skipped due to being in ALWAYS_EXCEPT. I'm wondering if it was a mistake to turn this test on at all. maloo reports:

Failure Rate: 66.00% of last 100 executions [all branches]

This failure looks not at all related to the change under test, at least in this case.

The sub-test test_200 failed with the following error:

request on sanity-hsm is not @@@@@@

Info required for matching: sanity-hsm 200



 Comments   
Comment by Andreas Dilger [ 31/Oct/13 ]

Failed in https://maloo.whamcloud.com/test_sets/96e2dc3e-41d1-11e3-85f0-52540035b04c also.

Comment by Andreas Dilger [ 31/Oct/13 ]

This failed about 10 times in the past week. Should this subtest be disabled again, or is there some obvious fix for this problem?

Comment by jacques-charles lafoucriere [ 31/Oct/13 ]

test 200 and 221 are using large files, so patch http://review.whamcloud.com/#/c/7915/ will help them to pass (not sure it is always the failure origin). I will try to put more message to understand the failure (may be MOUNT2 is missing)

Comment by jacques-charles lafoucriere [ 31/Oct/13 ]

Patch to understand at http://review.whamcloud.com/#/c/8113/ (not to ne reviewed)

Comment by Andreas Dilger [ 31/Oct/13 ]

For the short term I'd like to just skip these tests. The sanity-hsm test is causing 50% test failures. If there is a fix to the problem it can re-enable the tests again.

Comment by Bob Glossman (Inactive) [ 31/Oct/13 ]

patch to turn off offending subtests: http://review.whamcloud.com/8122

please add comments or suggestions there

Comment by Jian Yu [ 17/Jan/14 ]

While testing patch http://review.whamcloud.com/8834 on Lustre b2_5 branch, sanity-hsm test 200, 221, 250, 251 failed with "No space left on device" failure:

dd: writing `/mnt/lustre2/d0.sanity-hsm/d200/f.sanity-hsm.200': No space left on device

Maloo report: https://maloo.whamcloud.com/test_sets/6379144a-7f24-11e3-925a-52540035b04c

Comment by Jian Yu [ 13/Feb/14 ]

Patch http://review.whamcloud.com/9134 landed on Lustre b2_5 branch for 2.5.1.

Comment by Peter Jones [ 21/Feb/14 ]

Tests are now skipped for both 2.5.1 and 2.6. We should track fixing the individual tests separately.

Comment by Bruno Faccini (Inactive) [ 11/Mar/14 ]

Like for LU-3852, the root cause for this ticket is Toro/client-26vm* very small sized Lustre filesystem, that even recent cleanup_large_files() usage is not enough to fix. Next action will be to find a way to handle this "small fs size vs file big enough for timing need" requirement, and also to re-allow test_200/221/223b in same patch.

Since I am in charge of LU-3852 that is about the same root-cause, but for test_251, I also assign this ticket to me.

Comment by Bruno Faccini (Inactive) [ 11/Mar/14 ]

Oops, I did not notice ticket was closed/fixed before I commented and re-assigned to me !!
So, I will continue to track the main+same issue in LU-3852. test_[200,221,223b] will also be re-enabled in same patch/fix.

Comment by Andreas Dilger [ 02/Oct/14 ]

Tests are still being skipped due to this bug. Reopen until the tests are passing again.

Comment by Bruno Faccini (Inactive) [ 03/Oct/14 ]

Oops, seems that despite I said I will re-enable test_[200,221,223b] in same patch than test_251 for LU-3852, I have simply still not make it ... Thanks to re-open so I will not forget about these 3 sub-tests.

Comment by Bruno Faccini (Inactive) [ 03/Nov/14 ]

test_[200,221,223b] have finally, and as expected!, been re-enabled in patch http://review.whamcloud.com/12456 for LU-3852.

Comment by Gerrit Updater [ 17/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/8113/
Subject: LU-4178 tests: add messages to sanity-hsm
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 597214dc586ae9d329f2bb819600ac4b9ddfcc06

Comment by John Hammond [ 19/Dec/14 ]

Here is another test_200 failure on v2_6_91_0-49-ge0ece89 https://testing.hpdd.intel.com/test_sets/cc1dae4c-8703-11e4-87d3-5254006e85c2. I don't think that this failure is about file size. I ran this locally 10 times and never saw the cancel action succeed.

== sanity-hsm test 200: Register/Cancel archive == 09:08:31 (1419001711)
Only wakeup running copytool agt1 on t
103+0 records in
103+0 records out
108003328 bytes (108 MB) copied, 18.2603 s, 5.9 MB/s
mdt.lustre-MDT0000.hsm_control=disabled
mdt.lustre-MDT0000.hsm_control=enabled
Waiting 100 secs for update
Waiting 90 secs for update
Waiting 80 secs for update
Waiting 70 secs for update
Waiting 60 secs for update
Waiting 50 secs for update
Waiting 40 secs for update
Waiting 30 secs for update
Waiting 20 secs for update
Waiting 10 secs for update
Update not seen after 100s: wanted 'SUCCEED' got 'STARTED
STARTED'
 sanity-hsm test_200: @@@@@@ FAIL: request on 0x200000401:0x1:0x0 is not SUCCEED on mds1 
  Trace dump:
  = /root/lustre-release/lustre/tests/test-framework.sh:4667:error_noexit()
  = /root/lustre-release/lustre/tests/test-framework.sh:4698:error()
  = lustre/tests/sanity-hsm.sh:697:wait_request_state()
  = lustre/tests/sanity-hsm.sh:3482:test_200()
  = /root/lustre-release/lustre/tests/test-framework.sh:4945:run_one()
  = /root/lustre-release/lustre/tests/test-framework.sh:4982:run_one_logged()
  = /root/lustre-release/lustre/tests/test-framework.sh:4799:run_test()
  = lustre/tests/sanity-hsm.sh:3486:main()
Dumping lctl log to /tmp/test_logs/1419001709/sanity-hsm.test_200.*.1419001832.log
Dumping logs only on local client.
FAIL 200 (121s)
Comment by Bruno Faccini (Inactive) [ 19/Dec/14 ]

Hello John,
I agree with you that the new failures you reported don't look as file size nor free space related as previously (even if my patch for LU-3852 that has recently landed unfortunately does not correctly detect/report lack of free space, as you brilliantly demonstrated in LU-6056!!), but they seem to be the consequence of a new/different problem.
Could this be somewhat related to LU-3873 and/or LU-4545, at least with their comments regarding the status change of CANCEL requests and their handling by copytool?

Comment by John Hammond [ 20/Dec/14 ]

I have not carefully verified this but here is my idea. From the CT logs https://testing.hpdd.intel.com/test_logs/268d8b72-8704-11e4-87d3-5254006e85c2 I see that the cancel is handled by the CT before the archive request:

418932953.617511 lhsmtool_posix[19200]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre
1418932953.626199 lhsmtool_posix[19201]: waiting for message from kernel
1418932972.656373 lhsmtool_posix[19201]: copytool fs=lustre archive#=2 item_count=1
1418932972.656586 lhsmtool_posix[19201]: waiting for message from kernel
1418932972.656652 lhsmtool_posix[19203]: '[0x400000401:0x211:0x0]' action CANCEL reclen 72, cookie=0x549332b1
1418932972.657448 lhsmtool_posix[19201]: copytool fs=lustre archive#=2 item_count=1
1418932972.657528 lhsmtool_posix[19201]: waiting for message from kernel
1418932972.657618 lhsmtool_posix[19204]: '[0x400000401:0x211:0x0]' action ARCHIVE reclen 72, cookie=0x549332b1
1418932972.659924 lhsmtool_posix[19203]: processing file 'd200.sanity-hsm/f200.sanity-hsm'
1418932972.659961 lhsmtool_posix[19203]: cancel not implemented for file system '/mnt/lustre'
1418932972.659984 lhsmtool_posix[19204]: processing file 'd200.sanity-hsm/f200.sanity-hsm'
1418932972.731465 lhsmtool_posix[19204]: archiving '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' to '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp'
1418932987.009995 lhsmtool_posix[19204]: saving stripe info of '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' in /home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp.lov
1418932988.535890 lhsmtool_posix[19204]: start copy of 108003328 bytes from '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' to '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp'
1418933018.865594 lhsmtool_posix[19204]: %29 
1418933018.874360 lhsmtool_posix[19204]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s
1418933048.043158 lhsmtool_posix[19204]: %58 
1418933048.050016 lhsmtool_posix[19204]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s
1418933078.126741 lhsmtool_posix[19204]: %87 
1418933078.130387 lhsmtool_posix[19204]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s
1418933091.161624 lhsmtool_posix[19204]: copied 108003328 bytes in 102.857832 seconds
1418933097.333515 lhsmtool_posix[19204]: data archiving for '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' to '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' done
1418933098.588120 lhsmtool_posix[19204]: attr file for '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' saved to archive '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp'
1418933098.590178 lhsmtool_posix[19204]: fsetxattr of 'trusted.hsm' on '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' rc=-1 (Operation not supported)
1418933098.590223 lhsmtool_posix[19204]: fsetxattr of 'trusted.link' on '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' rc=-1 (Operation not supported)
1418933098.590258 lhsmtool_posix[19204]: fsetxattr of 'trusted.lov' on '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' rc=-1 (Operation not supported)
1418933098.590308 lhsmtool_posix[19204]: fsetxattr of 'trusted.lma' on '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' rc=-1 (Operation not supported)
1418933098.692510 lhsmtool_posix[19204]: fsetxattr of 'lustre.lov' on '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' rc=-1 (Operation not supported)
1418933098.692559 lhsmtool_posix[19204]: xattr file for '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' saved to archive '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp'
1418933099.066853 lhsmtool_posix[19204]: symlink '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/shadow/d200.sanity-hsm/f200.sanity-hsm' to '../../0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0' done
1418933099.068225 lhsmtool_posix[19204]: Action completed, notifying coordinator cookie=0x549332b1, FID=[0x400000401:0x211:0x0], hp_flags=0 err=0
1418933099.070061 lhsmtool_posix[19204]: llapi_hsm_action_end() on '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' ok (rc=0)
exiting: Interrupt

I wonder if the wrong ordering is due to thread scheduling on the CT or if it comes from somewhere else. Perhaps the CDT processes the requests in the wrong order. When does a cancel succeed? IIRC when the CT makes a progress report on the action. Is that correct?

Since the CT does not implement cancel, I also wonder what precisely we are testing here.

Comment by Blake Caldwell [ 23/Dec/14 ]

Here's a failure instance: https://testing.hpdd.intel.com/test_sets/b63defe6-8a7e-11e4-9b75-5254006e85c2. Appears to be a transient failure?

Comment by Gerrit Updater [ 23/Dec/14 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/13173
Subject: LU-4178 tests: Wait requests to reach CDT before Cancel
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 918ccc4184b91d61b53016f8e3375bf9d4698afc

Comment by Bruno Faccini (Inactive) [ 23/Dec/14 ]

John,
I think that sub-tests 200-202 verify that requests can be canceled when CDT operations have been started but disabled.
BTW, your finding that sometime (very likely due to some threading cause at Agent and/or MDS side!) the Cancel requests is treated by CDT before the action it targets needs to be addressed and also explains why these failures are not solid.

Patch that adds verification that the operation has already been registered at CDT before to send the Cancel in sanity-hsm/test_[200-202], is at http://review.whamcloud.com/13173.

Comment by jacques-charles lafoucriere [ 29/Dec/14 ]

The analysis and patch look good but I am surprised because the CDT command registration should be synchronous. So after hsm_archive the CDT entry should be recorded.

Comment by Gerrit Updater [ 30/Dec/14 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/13206
Subject: LU-4178 tests: increase sanity-hsm wait_request_state tiemout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 278c1fb845c2cdd7905f717435176e94a4ad7057

Comment by John Hammond [ 30/Dec/14 ]

The cancel action does not succeed until the CT reports that the archive is complete. In test_200 we use make_large_for_cancel() which gives us a 100MB file. Because of the 1MB/s bandwidth limit the CT will take at least 100s to archive the file. Since wait_request_state() uses a 100s timeout this make for a very racy test. And since most of these tests still use NFS for the archive there can be additional delays.

I suggest that we double the timeout in wait_request_state(). Please see http://review.whamcloud.com/13206.

Comment by Gerrit Updater [ 04/Jan/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13206/
Subject: LU-4178 tests: increase sanity-hsm wait_request_state tiemout
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4cb51c76ed2afa168f19e999190a315803580258

Comment by Jodi Levi (Inactive) [ 11/Feb/15 ]

Patches landed to Master.

Comment by James Nunez (Inactive) [ 19/Feb/15 ]

Reopening ticket because there is one more patch for this ticket that has not landed. The patch is at: http://review.whamcloud.com/#/c/13173/

Comment by Gerrit Updater [ 20/Feb/15 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/13825
Subject: LU-4178 tests: add messages to sanity-hsm
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 75d6d35bcc48eefe490e8b4efd673c58b3373507

Comment by Gerrit Updater [ 20/Feb/15 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/13826
Subject: LU-4178 tests: increase sanity-hsm wait_request_state tiemout
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 9c56a8f64d2ab4b8db8b3f38dff2d019b8cd3e40

Comment by James Nunez (Inactive) [ 23/Feb/15 ]

Closing ticket because sanity-hsm tests 200, 201 and 202 are passing on master for the past month. If any more work needs to be done for this ticket, please open a new ticket and we'll track the work there.

Comment by Gerrit Updater [ 06/Oct/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13173/
Subject: LU-4178 tests: Wait requests to reach CDT before Cancel
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6a31cf92555182a23f14d3385c8c14266887070a

Generated at Sat Feb 10 01:40:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.