[LU-4178] Test failure on test suite sanity-hsm, subtest test_200 Created: 29/Oct/13 Updated: 31/Aug/17 Resolved: 25/Feb/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.7.0, Lustre 2.8.0, Lustre 2.5.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HSM | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 11309 | ||||||||
| Description |
|
This issue was created by maloo for Bob Glossman <bob.glossman@intel.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/4c3bcdec-4025-11e3-bfaf-52540035b04c. test_200 was only recently enabled by commit 38695729d61958ab10e9e108175298f8a7d40536. before that is was always skipped due to being in ALWAYS_EXCEPT. I'm wondering if it was a mistake to turn this test on at all. maloo reports: Failure Rate: 66.00% of last 100 executions [all branches] This failure looks not at all related to the change under test, at least in this case. The sub-test test_200 failed with the following error:
Info required for matching: sanity-hsm 200 |
| Comments |
| Comment by Andreas Dilger [ 31/Oct/13 ] |
|
Failed in https://maloo.whamcloud.com/test_sets/96e2dc3e-41d1-11e3-85f0-52540035b04c also. |
| Comment by Andreas Dilger [ 31/Oct/13 ] |
|
This failed about 10 times in the past week. Should this subtest be disabled again, or is there some obvious fix for this problem? |
| Comment by jacques-charles lafoucriere [ 31/Oct/13 ] |
|
test 200 and 221 are using large files, so patch http://review.whamcloud.com/#/c/7915/ will help them to pass (not sure it is always the failure origin). I will try to put more message to understand the failure (may be MOUNT2 is missing) |
| Comment by jacques-charles lafoucriere [ 31/Oct/13 ] |
|
Patch to understand at http://review.whamcloud.com/#/c/8113/ (not to ne reviewed) |
| Comment by Andreas Dilger [ 31/Oct/13 ] |
|
For the short term I'd like to just skip these tests. The sanity-hsm test is causing 50% test failures. If there is a fix to the problem it can re-enable the tests again. |
| Comment by Bob Glossman (Inactive) [ 31/Oct/13 ] |
|
patch to turn off offending subtests: http://review.whamcloud.com/8122 please add comments or suggestions there |
| Comment by Jian Yu [ 17/Jan/14 ] |
|
While testing patch http://review.whamcloud.com/8834 on Lustre b2_5 branch, sanity-hsm test 200, 221, 250, 251 failed with "No space left on device" failure: dd: writing `/mnt/lustre2/d0.sanity-hsm/d200/f.sanity-hsm.200': No space left on device Maloo report: https://maloo.whamcloud.com/test_sets/6379144a-7f24-11e3-925a-52540035b04c |
| Comment by Jian Yu [ 13/Feb/14 ] |
|
Patch http://review.whamcloud.com/9134 landed on Lustre b2_5 branch for 2.5.1. |
| Comment by Peter Jones [ 21/Feb/14 ] |
|
Tests are now skipped for both 2.5.1 and 2.6. We should track fixing the individual tests separately. |
| Comment by Bruno Faccini (Inactive) [ 11/Mar/14 ] |
|
Like for Since I am in charge of |
| Comment by Bruno Faccini (Inactive) [ 11/Mar/14 ] |
|
Oops, I did not notice ticket was closed/fixed before I commented and re-assigned to me !! |
| Comment by Andreas Dilger [ 02/Oct/14 ] |
|
Tests are still being skipped due to this bug. Reopen until the tests are passing again. |
| Comment by Bruno Faccini (Inactive) [ 03/Oct/14 ] |
|
Oops, seems that despite I said I will re-enable test_[200,221,223b] in same patch than test_251 for |
| Comment by Bruno Faccini (Inactive) [ 03/Nov/14 ] |
|
test_[200,221,223b] have finally, and as expected!, been re-enabled in patch http://review.whamcloud.com/12456 for |
| Comment by Gerrit Updater [ 17/Dec/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/8113/ |
| Comment by John Hammond [ 19/Dec/14 ] |
|
Here is another test_200 failure on v2_6_91_0-49-ge0ece89 https://testing.hpdd.intel.com/test_sets/cc1dae4c-8703-11e4-87d3-5254006e85c2. I don't think that this failure is about file size. I ran this locally 10 times and never saw the cancel action succeed. == sanity-hsm test 200: Register/Cancel archive == 09:08:31 (1419001711) Only wakeup running copytool agt1 on t 103+0 records in 103+0 records out 108003328 bytes (108 MB) copied, 18.2603 s, 5.9 MB/s mdt.lustre-MDT0000.hsm_control=disabled mdt.lustre-MDT0000.hsm_control=enabled Waiting 100 secs for update Waiting 90 secs for update Waiting 80 secs for update Waiting 70 secs for update Waiting 60 secs for update Waiting 50 secs for update Waiting 40 secs for update Waiting 30 secs for update Waiting 20 secs for update Waiting 10 secs for update Update not seen after 100s: wanted 'SUCCEED' got 'STARTED STARTED' sanity-hsm test_200: @@@@@@ FAIL: request on 0x200000401:0x1:0x0 is not SUCCEED on mds1 Trace dump: = /root/lustre-release/lustre/tests/test-framework.sh:4667:error_noexit() = /root/lustre-release/lustre/tests/test-framework.sh:4698:error() = lustre/tests/sanity-hsm.sh:697:wait_request_state() = lustre/tests/sanity-hsm.sh:3482:test_200() = /root/lustre-release/lustre/tests/test-framework.sh:4945:run_one() = /root/lustre-release/lustre/tests/test-framework.sh:4982:run_one_logged() = /root/lustre-release/lustre/tests/test-framework.sh:4799:run_test() = lustre/tests/sanity-hsm.sh:3486:main() Dumping lctl log to /tmp/test_logs/1419001709/sanity-hsm.test_200.*.1419001832.log Dumping logs only on local client. FAIL 200 (121s) |
| Comment by Bruno Faccini (Inactive) [ 19/Dec/14 ] |
|
Hello John, |
| Comment by John Hammond [ 20/Dec/14 ] |
|
I have not carefully verified this but here is my idea. From the CT logs https://testing.hpdd.intel.com/test_logs/268d8b72-8704-11e4-87d3-5254006e85c2 I see that the cancel is handled by the CT before the archive request: 418932953.617511 lhsmtool_posix[19200]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre 1418932953.626199 lhsmtool_posix[19201]: waiting for message from kernel 1418932972.656373 lhsmtool_posix[19201]: copytool fs=lustre archive#=2 item_count=1 1418932972.656586 lhsmtool_posix[19201]: waiting for message from kernel 1418932972.656652 lhsmtool_posix[19203]: '[0x400000401:0x211:0x0]' action CANCEL reclen 72, cookie=0x549332b1 1418932972.657448 lhsmtool_posix[19201]: copytool fs=lustre archive#=2 item_count=1 1418932972.657528 lhsmtool_posix[19201]: waiting for message from kernel 1418932972.657618 lhsmtool_posix[19204]: '[0x400000401:0x211:0x0]' action ARCHIVE reclen 72, cookie=0x549332b1 1418932972.659924 lhsmtool_posix[19203]: processing file 'd200.sanity-hsm/f200.sanity-hsm' 1418932972.659961 lhsmtool_posix[19203]: cancel not implemented for file system '/mnt/lustre' 1418932972.659984 lhsmtool_posix[19204]: processing file 'd200.sanity-hsm/f200.sanity-hsm' 1418932972.731465 lhsmtool_posix[19204]: archiving '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' to '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' 1418932987.009995 lhsmtool_posix[19204]: saving stripe info of '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' in /home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp.lov 1418932988.535890 lhsmtool_posix[19204]: start copy of 108003328 bytes from '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' to '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' 1418933018.865594 lhsmtool_posix[19204]: %29 1418933018.874360 lhsmtool_posix[19204]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s 1418933048.043158 lhsmtool_posix[19204]: %58 1418933048.050016 lhsmtool_posix[19204]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s 1418933078.126741 lhsmtool_posix[19204]: %87 1418933078.130387 lhsmtool_posix[19204]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s 1418933091.161624 lhsmtool_posix[19204]: copied 108003328 bytes in 102.857832 seconds 1418933097.333515 lhsmtool_posix[19204]: data archiving for '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' to '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' done 1418933098.588120 lhsmtool_posix[19204]: attr file for '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' saved to archive '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' 1418933098.590178 lhsmtool_posix[19204]: fsetxattr of 'trusted.hsm' on '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' rc=-1 (Operation not supported) 1418933098.590223 lhsmtool_posix[19204]: fsetxattr of 'trusted.link' on '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' rc=-1 (Operation not supported) 1418933098.590258 lhsmtool_posix[19204]: fsetxattr of 'trusted.lov' on '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' rc=-1 (Operation not supported) 1418933098.590308 lhsmtool_posix[19204]: fsetxattr of 'trusted.lma' on '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' rc=-1 (Operation not supported) 1418933098.692510 lhsmtool_posix[19204]: fsetxattr of 'lustre.lov' on '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' rc=-1 (Operation not supported) 1418933098.692559 lhsmtool_posix[19204]: xattr file for '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' saved to archive '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' 1418933099.066853 lhsmtool_posix[19204]: symlink '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/shadow/d200.sanity-hsm/f200.sanity-hsm' to '../../0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0' done 1418933099.068225 lhsmtool_posix[19204]: Action completed, notifying coordinator cookie=0x549332b1, FID=[0x400000401:0x211:0x0], hp_flags=0 err=0 1418933099.070061 lhsmtool_posix[19204]: llapi_hsm_action_end() on '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' ok (rc=0) exiting: Interrupt I wonder if the wrong ordering is due to thread scheduling on the CT or if it comes from somewhere else. Perhaps the CDT processes the requests in the wrong order. When does a cancel succeed? IIRC when the CT makes a progress report on the action. Is that correct? Since the CT does not implement cancel, I also wonder what precisely we are testing here. |
| Comment by Blake Caldwell [ 23/Dec/14 ] |
|
Here's a failure instance: https://testing.hpdd.intel.com/test_sets/b63defe6-8a7e-11e4-9b75-5254006e85c2. Appears to be a transient failure? |
| Comment by Gerrit Updater [ 23/Dec/14 ] |
|
Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/13173 |
| Comment by Bruno Faccini (Inactive) [ 23/Dec/14 ] |
|
John, Patch that adds verification that the operation has already been registered at CDT before to send the Cancel in sanity-hsm/test_[200-202], is at http://review.whamcloud.com/13173. |
| Comment by jacques-charles lafoucriere [ 29/Dec/14 ] |
|
The analysis and patch look good but I am surprised because the CDT command registration should be synchronous. So after hsm_archive the CDT entry should be recorded. |
| Comment by Gerrit Updater [ 30/Dec/14 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/13206 |
| Comment by John Hammond [ 30/Dec/14 ] |
|
The cancel action does not succeed until the CT reports that the archive is complete. In test_200 we use make_large_for_cancel() which gives us a 100MB file. Because of the 1MB/s bandwidth limit the CT will take at least 100s to archive the file. Since wait_request_state() uses a 100s timeout this make for a very racy test. And since most of these tests still use NFS for the archive there can be additional delays. I suggest that we double the timeout in wait_request_state(). Please see http://review.whamcloud.com/13206. |
| Comment by Gerrit Updater [ 04/Jan/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13206/ |
| Comment by Jodi Levi (Inactive) [ 11/Feb/15 ] |
|
Patches landed to Master. |
| Comment by James Nunez (Inactive) [ 19/Feb/15 ] |
|
Reopening ticket because there is one more patch for this ticket that has not landed. The patch is at: http://review.whamcloud.com/#/c/13173/ |
| Comment by Gerrit Updater [ 20/Feb/15 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/13825 |
| Comment by Gerrit Updater [ 20/Feb/15 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/13826 |
| Comment by James Nunez (Inactive) [ 23/Feb/15 ] |
|
Closing ticket because sanity-hsm tests 200, 201 and 202 are passing on master for the past month. If any more work needs to be done for this ticket, please open a new ticket and we'll track the work there. |
| Comment by Gerrit Updater [ 06/Oct/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13173/ |