Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4178

Test failure on test suite sanity-hsm, subtest test_200

Details

    • 3
    • 11309

    Description

      This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/4c3bcdec-4025-11e3-bfaf-52540035b04c.

      test_200 was only recently enabled by commit 38695729d61958ab10e9e108175298f8a7d40536. before that is was always skipped due to being in ALWAYS_EXCEPT. I'm wondering if it was a mistake to turn this test on at all. maloo reports:

      Failure Rate: 66.00% of last 100 executions [all branches]

      This failure looks not at all related to the change under test, at least in this case.

      The sub-test test_200 failed with the following error:

      request on sanity-hsm is not @@@@@@

      Info required for matching: sanity-hsm 200

      Attachments

        Issue Links

          Activity

            [LU-4178] Test failure on test suite sanity-hsm, subtest test_200

            James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/13825
            Subject: LU-4178 tests: add messages to sanity-hsm
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: 75d6d35bcc48eefe490e8b4efd673c58b3373507

            gerrit Gerrit Updater added a comment - James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/13825 Subject: LU-4178 tests: add messages to sanity-hsm Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 75d6d35bcc48eefe490e8b4efd673c58b3373507

            Reopening ticket because there is one more patch for this ticket that has not landed. The patch is at: http://review.whamcloud.com/#/c/13173/

            jamesanunez James Nunez (Inactive) added a comment - Reopening ticket because there is one more patch for this ticket that has not landed. The patch is at: http://review.whamcloud.com/#/c/13173/

            Patches landed to Master.

            jlevi Jodi Levi (Inactive) added a comment - Patches landed to Master.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13206/
            Subject: LU-4178 tests: increase sanity-hsm wait_request_state tiemout
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4cb51c76ed2afa168f19e999190a315803580258

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13206/ Subject: LU-4178 tests: increase sanity-hsm wait_request_state tiemout Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4cb51c76ed2afa168f19e999190a315803580258
            jhammond John Hammond added a comment -

            The cancel action does not succeed until the CT reports that the archive is complete. In test_200 we use make_large_for_cancel() which gives us a 100MB file. Because of the 1MB/s bandwidth limit the CT will take at least 100s to archive the file. Since wait_request_state() uses a 100s timeout this make for a very racy test. And since most of these tests still use NFS for the archive there can be additional delays.

            I suggest that we double the timeout in wait_request_state(). Please see http://review.whamcloud.com/13206.

            jhammond John Hammond added a comment - The cancel action does not succeed until the CT reports that the archive is complete. In test_200 we use make_large_for_cancel() which gives us a 100MB file. Because of the 1MB/s bandwidth limit the CT will take at least 100s to archive the file. Since wait_request_state() uses a 100s timeout this make for a very racy test. And since most of these tests still use NFS for the archive there can be additional delays. I suggest that we double the timeout in wait_request_state(). Please see http://review.whamcloud.com/13206 .

            John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/13206
            Subject: LU-4178 tests: increase sanity-hsm wait_request_state tiemout
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 278c1fb845c2cdd7905f717435176e94a4ad7057

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/13206 Subject: LU-4178 tests: increase sanity-hsm wait_request_state tiemout Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 278c1fb845c2cdd7905f717435176e94a4ad7057

            The analysis and patch look good but I am surprised because the CDT command registration should be synchronous. So after hsm_archive the CDT entry should be recorded.

            jcl jacques-charles lafoucriere added a comment - The analysis and patch look good but I am surprised because the CDT command registration should be synchronous. So after hsm_archive the CDT entry should be recorded.

            John,
            I think that sub-tests 200-202 verify that requests can be canceled when CDT operations have been started but disabled.
            BTW, your finding that sometime (very likely due to some threading cause at Agent and/or MDS side!) the Cancel requests is treated by CDT before the action it targets needs to be addressed and also explains why these failures are not solid.

            Patch that adds verification that the operation has already been registered at CDT before to send the Cancel in sanity-hsm/test_[200-202], is at http://review.whamcloud.com/13173.

            bfaccini Bruno Faccini (Inactive) added a comment - John, I think that sub-tests 200-202 verify that requests can be canceled when CDT operations have been started but disabled. BTW, your finding that sometime (very likely due to some threading cause at Agent and/or MDS side!) the Cancel requests is treated by CDT before the action it targets needs to be addressed and also explains why these failures are not solid. Patch that adds verification that the operation has already been registered at CDT before to send the Cancel in sanity-hsm/test_ [200-202] , is at http://review.whamcloud.com/13173 .

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/13173
            Subject: LU-4178 tests: Wait requests to reach CDT before Cancel
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 918ccc4184b91d61b53016f8e3375bf9d4698afc

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/13173 Subject: LU-4178 tests: Wait requests to reach CDT before Cancel Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 918ccc4184b91d61b53016f8e3375bf9d4698afc

            Here's a failure instance: https://testing.hpdd.intel.com/test_sets/b63defe6-8a7e-11e4-9b75-5254006e85c2. Appears to be a transient failure?

            blakecaldwell Blake Caldwell added a comment - Here's a failure instance: https://testing.hpdd.intel.com/test_sets/b63defe6-8a7e-11e4-9b75-5254006e85c2 . Appears to be a transient failure?
            jhammond John Hammond added a comment -

            I have not carefully verified this but here is my idea. From the CT logs https://testing.hpdd.intel.com/test_logs/268d8b72-8704-11e4-87d3-5254006e85c2 I see that the cancel is handled by the CT before the archive request:

            418932953.617511 lhsmtool_posix[19200]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre
            1418932953.626199 lhsmtool_posix[19201]: waiting for message from kernel
            1418932972.656373 lhsmtool_posix[19201]: copytool fs=lustre archive#=2 item_count=1
            1418932972.656586 lhsmtool_posix[19201]: waiting for message from kernel
            1418932972.656652 lhsmtool_posix[19203]: '[0x400000401:0x211:0x0]' action CANCEL reclen 72, cookie=0x549332b1
            1418932972.657448 lhsmtool_posix[19201]: copytool fs=lustre archive#=2 item_count=1
            1418932972.657528 lhsmtool_posix[19201]: waiting for message from kernel
            1418932972.657618 lhsmtool_posix[19204]: '[0x400000401:0x211:0x0]' action ARCHIVE reclen 72, cookie=0x549332b1
            1418932972.659924 lhsmtool_posix[19203]: processing file 'd200.sanity-hsm/f200.sanity-hsm'
            1418932972.659961 lhsmtool_posix[19203]: cancel not implemented for file system '/mnt/lustre'
            1418932972.659984 lhsmtool_posix[19204]: processing file 'd200.sanity-hsm/f200.sanity-hsm'
            1418932972.731465 lhsmtool_posix[19204]: archiving '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' to '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp'
            1418932987.009995 lhsmtool_posix[19204]: saving stripe info of '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' in /home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp.lov
            1418932988.535890 lhsmtool_posix[19204]: start copy of 108003328 bytes from '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' to '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp'
            1418933018.865594 lhsmtool_posix[19204]: %29 
            1418933018.874360 lhsmtool_posix[19204]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s
            1418933048.043158 lhsmtool_posix[19204]: %58 
            1418933048.050016 lhsmtool_posix[19204]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s
            1418933078.126741 lhsmtool_posix[19204]: %87 
            1418933078.130387 lhsmtool_posix[19204]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s
            1418933091.161624 lhsmtool_posix[19204]: copied 108003328 bytes in 102.857832 seconds
            1418933097.333515 lhsmtool_posix[19204]: data archiving for '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' to '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' done
            1418933098.588120 lhsmtool_posix[19204]: attr file for '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' saved to archive '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp'
            1418933098.590178 lhsmtool_posix[19204]: fsetxattr of 'trusted.hsm' on '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' rc=-1 (Operation not supported)
            1418933098.590223 lhsmtool_posix[19204]: fsetxattr of 'trusted.link' on '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' rc=-1 (Operation not supported)
            1418933098.590258 lhsmtool_posix[19204]: fsetxattr of 'trusted.lov' on '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' rc=-1 (Operation not supported)
            1418933098.590308 lhsmtool_posix[19204]: fsetxattr of 'trusted.lma' on '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' rc=-1 (Operation not supported)
            1418933098.692510 lhsmtool_posix[19204]: fsetxattr of 'lustre.lov' on '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' rc=-1 (Operation not supported)
            1418933098.692559 lhsmtool_posix[19204]: xattr file for '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' saved to archive '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp'
            1418933099.066853 lhsmtool_posix[19204]: symlink '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/shadow/d200.sanity-hsm/f200.sanity-hsm' to '../../0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0' done
            1418933099.068225 lhsmtool_posix[19204]: Action completed, notifying coordinator cookie=0x549332b1, FID=[0x400000401:0x211:0x0], hp_flags=0 err=0
            1418933099.070061 lhsmtool_posix[19204]: llapi_hsm_action_end() on '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' ok (rc=0)
            exiting: Interrupt
            

            I wonder if the wrong ordering is due to thread scheduling on the CT or if it comes from somewhere else. Perhaps the CDT processes the requests in the wrong order. When does a cancel succeed? IIRC when the CT makes a progress report on the action. Is that correct?

            Since the CT does not implement cancel, I also wonder what precisely we are testing here.

            jhammond John Hammond added a comment - I have not carefully verified this but here is my idea. From the CT logs https://testing.hpdd.intel.com/test_logs/268d8b72-8704-11e4-87d3-5254006e85c2 I see that the cancel is handled by the CT before the archive request: 418932953.617511 lhsmtool_posix[19200]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre 1418932953.626199 lhsmtool_posix[19201]: waiting for message from kernel 1418932972.656373 lhsmtool_posix[19201]: copytool fs=lustre archive#=2 item_count=1 1418932972.656586 lhsmtool_posix[19201]: waiting for message from kernel 1418932972.656652 lhsmtool_posix[19203]: '[0x400000401:0x211:0x0]' action CANCEL reclen 72, cookie=0x549332b1 1418932972.657448 lhsmtool_posix[19201]: copytool fs=lustre archive#=2 item_count=1 1418932972.657528 lhsmtool_posix[19201]: waiting for message from kernel 1418932972.657618 lhsmtool_posix[19204]: '[0x400000401:0x211:0x0]' action ARCHIVE reclen 72, cookie=0x549332b1 1418932972.659924 lhsmtool_posix[19203]: processing file 'd200.sanity-hsm/f200.sanity-hsm' 1418932972.659961 lhsmtool_posix[19203]: cancel not implemented for file system '/mnt/lustre' 1418932972.659984 lhsmtool_posix[19204]: processing file 'd200.sanity-hsm/f200.sanity-hsm' 1418932972.731465 lhsmtool_posix[19204]: archiving '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' to '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' 1418932987.009995 lhsmtool_posix[19204]: saving stripe info of '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' in /home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp.lov 1418932988.535890 lhsmtool_posix[19204]: start copy of 108003328 bytes from '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' to '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' 1418933018.865594 lhsmtool_posix[19204]: %29 1418933018.874360 lhsmtool_posix[19204]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s 1418933048.043158 lhsmtool_posix[19204]: %58 1418933048.050016 lhsmtool_posix[19204]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s 1418933078.126741 lhsmtool_posix[19204]: %87 1418933078.130387 lhsmtool_posix[19204]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s 1418933091.161624 lhsmtool_posix[19204]: copied 108003328 bytes in 102.857832 seconds 1418933097.333515 lhsmtool_posix[19204]: data archiving for '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' to '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' done 1418933098.588120 lhsmtool_posix[19204]: attr file for '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' saved to archive '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' 1418933098.590178 lhsmtool_posix[19204]: fsetxattr of 'trusted.hsm' on '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' rc=-1 (Operation not supported) 1418933098.590223 lhsmtool_posix[19204]: fsetxattr of 'trusted.link' on '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' rc=-1 (Operation not supported) 1418933098.590258 lhsmtool_posix[19204]: fsetxattr of 'trusted.lov' on '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' rc=-1 (Operation not supported) 1418933098.590308 lhsmtool_posix[19204]: fsetxattr of 'trusted.lma' on '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' rc=-1 (Operation not supported) 1418933098.692510 lhsmtool_posix[19204]: fsetxattr of 'lustre.lov' on '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' rc=-1 (Operation not supported) 1418933098.692559 lhsmtool_posix[19204]: xattr file for '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' saved to archive '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0_tmp' 1418933099.066853 lhsmtool_posix[19204]: symlink '/home/autotest/.autotest/shared_dir/2014-12-18/080743-70100366414020/arc1/shadow/d200.sanity-hsm/f200.sanity-hsm' to '../../0211/0000/0401/0000/0004/0000/0x400000401:0x211:0x0' done 1418933099.068225 lhsmtool_posix[19204]: Action completed, notifying coordinator cookie=0x549332b1, FID=[0x400000401:0x211:0x0], hp_flags=0 err=0 1418933099.070061 lhsmtool_posix[19204]: llapi_hsm_action_end() on '/mnt/lustre/.lustre/fid/0x400000401:0x211:0x0' ok (rc=0) exiting: Interrupt I wonder if the wrong ordering is due to thread scheduling on the CT or if it comes from somewhere else. Perhaps the CDT processes the requests in the wrong order. When does a cancel succeed? IIRC when the CT makes a progress report on the action. Is that correct? Since the CT does not implement cancel, I also wonder what precisely we are testing here.

            People

              bfaccini Bruno Faccini (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: