Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5474

Test failure sanity-hsm test_90: requests did not complete

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.7.0, Lustre 2.5.4
    • Lustre 2.7.0, Lustre 2.5.4
    • 3
    • 15251

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run:
      https://testing.hpdd.intel.com/test_sets/90f1f8a2-1ff0-11e4-8610-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/29b71c62-020e-11e4-a47c-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/e775fca2-021b-11e4-9435-5254006e85c2

      The sub-test test_90 failed with the following error:

      requests did not complete

      Info required for matching: sanity-hsm 90

      Attachments

        Issue Links

          Activity

            [LU-5474] Test failure sanity-hsm test_90: requests did not complete
            pjones Peter Jones added a comment -

            Landed for 2.7

            pjones Peter Jones added a comment - Landed for 2.7

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12069/
            Subject: LU-5474 tests: sanity-hsm test_90 use local HSM_ARCHIVE
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 8641822bdb248a034bf4065c7e2e60cdb7a47041

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12069/ Subject: LU-5474 tests: sanity-hsm test_90 use local HSM_ARCHIVE Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8641822bdb248a034bf4065c7e2e60cdb7a47041

            I've updated sanity-hsm test 90 to use a local disk for archive as was implemented in test 40. For the last two runs with this change, test 90 has failed to archive any files from the file list; https://testing.hpdd.intel.com/test_sessions/5ac85a16-6ee0-11e4-81e7-5254006e85c2 and https://testing.hpdd.intel.com/test_sets/616b778e-6b77-11e4-b1b4-5254006e85c2 .

            For test 90, it seems that the archive number specified to the copytool is not the same number as the archive requests. The archive number specified in the call to the copytool is #2 and the archive requests are looking for archive #1. From the client test log:

            CMD: shadow-3vm9 lhsmtool_posix  --daemon --hsm-root /tmp/d90.sanity-hsm --archive 2 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2014-11-17/lustre-reviews-el6-x86_64--review-dne-part-2--2_9_1__28407__-70078147694660-145306/sanity-hsm.test_90.copytool2_log.shadow-3vm9.log 2>&1
            ...
            Changed after 8s: from 'lrh=[type=10680000 len=136 idx=1/1846] fid=[0x400000401:0x1c4:0x0] dfid=[0x400000401:0x1c4:0x0] compound/cookie=0x546ab2b6/0x546ab2b6 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[]
            lrh=[type=10680000 len=136 idx=1/1847] fid=[0x400000401:0x1c5:0x0] dfid=[0x400000401:0x1c5:0x0] compound/cookie=0x546ab2b6/0x546ab2b7 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[]
            

            On the MDS, it complains that it can't find the agent, which is true, archive #2 was started by the copytool:

            LustreError: 14430:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) lustre-MDT0000: Cannot find agent for archive 1: rc = -11
            

            Yet, test 40 uses essentially the same calls to start the copytool and the archive numbers for this test are consistent. Test 90 archives from a file list and test 40 does not.

            jamesanunez James Nunez (Inactive) added a comment - I've updated sanity-hsm test 90 to use a local disk for archive as was implemented in test 40. For the last two runs with this change, test 90 has failed to archive any files from the file list; https://testing.hpdd.intel.com/test_sessions/5ac85a16-6ee0-11e4-81e7-5254006e85c2 and https://testing.hpdd.intel.com/test_sets/616b778e-6b77-11e4-b1b4-5254006e85c2 . For test 90, it seems that the archive number specified to the copytool is not the same number as the archive requests. The archive number specified in the call to the copytool is #2 and the archive requests are looking for archive #1. From the client test log: CMD: shadow-3vm9 lhsmtool_posix --daemon --hsm-root /tmp/d90.sanity-hsm --archive 2 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2014-11-17/lustre-reviews-el6-x86_64--review-dne-part-2--2_9_1__28407__-70078147694660-145306/sanity-hsm.test_90.copytool2_log.shadow-3vm9.log 2>&1 ... Changed after 8s: from 'lrh=[type=10680000 len=136 idx=1/1846] fid=[0x400000401:0x1c4:0x0] dfid=[0x400000401:0x1c4:0x0] compound/cookie=0x546ab2b6/0x546ab2b6 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[] lrh=[type=10680000 len=136 idx=1/1847] fid=[0x400000401:0x1c5:0x0] dfid=[0x400000401:0x1c5:0x0] compound/cookie=0x546ab2b6/0x546ab2b7 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[] On the MDS, it complains that it can't find the agent, which is true, archive #2 was started by the copytool: LustreError: 14430:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) lustre-MDT0000: Cannot find agent for archive 1: rc = -11 Yet, test 40 uses essentially the same calls to start the copytool and the archive numbers for this test are consistent. Test 90 archives from a file list and test 40 does not.

            James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/12069
            Subject: LU-5474 tests: sanity-hsm test_90 use local HSM_ARCHIVE
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 9
            Commit: 7a0b4705f130f318acc0aaf6297cd4e0068060e5

            gerrit Gerrit Updater added a comment - James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/12069 Subject: LU-5474 tests: sanity-hsm test_90 use local HSM_ARCHIVE Project: fs/lustre-release Branch: master Current Patch Set: 9 Commit: 7a0b4705f130f318acc0aaf6297cd4e0068060e5

            Andreas,
            I modified http://review.whamcloud.com/#/c/12069/ to use $TMP for the archive.

            jamesanunez James Nunez (Inactive) added a comment - Andreas, I modified http://review.whamcloud.com/#/c/12069/ to use $TMP for the archive.

            Bruno, can you please make a patch to sanity-hsm test_90() to use $TMP for the HSM Archive, as was done with test_40()? Is there any reason not to make all HSM tests use $TMP for the archive by default, and only make some tests use the NFS shared directory if needed (very large archive, or HSM failover testing)? That would speed up all the HSM tests I think, unless it filled up $TMP and caused test failures.

            adilger Andreas Dilger added a comment - Bruno, can you please make a patch to sanity-hsm test_90() to use $TMP for the HSM Archive, as was done with test_40()? Is there any reason not to make all HSM tests use $TMP for the archive by default, and only make some tests use the NFS shared directory if needed (very large archive, or HSM failover testing)? That would speed up all the HSM tests I think, unless it filled up $TMP and caused test failures.
            yong.fan nasf (Inactive) added a comment - Another failure instance on b2_5: https://testing.hpdd.intel.com/test_sets/d03f6910-681e-11e4-acbe-5254006e85c2

            Andreas, yes this is tracked in LU-3939. And I remember that I already have made a similar+related comment about test_90 there ...

            bfaccini Bruno Faccini (Inactive) added a comment - Andreas, yes this is tracked in LU-3939 . And I remember that I already have made a similar+related comment about test_90 there ...

            I recall seeing a patch to move the HSM archive into $TMP instead of the shared directory. That might speed up the testing, as long as the archive doesn't need to be too large.

            adilger Andreas Dilger added a comment - I recall seeing a patch to move the HSM archive into $TMP instead of the shared directory. That might speed up the testing, as long as the archive doesn't need to be too large.

            Test patch at http://review.whamcloud.com/#/c/12069/ .

            This patch is to collect data on sanity-hsm test 90 failures. Looking over previous logs does not point to a problem in the test nor in the Lustre/HSM code. It just looks like archiving 51 files sometimes takes longer than 100 seconds. The patch tests if allowing for more time allows the files to be archived.

            jamesanunez James Nunez (Inactive) added a comment - Test patch at http://review.whamcloud.com/#/c/12069/ . This patch is to collect data on sanity-hsm test 90 failures. Looking over previous logs does not point to a problem in the test nor in the Lustre/HSM code. It just looks like archiving 51 files sometimes takes longer than 100 seconds. The patch tests if allowing for more time allows the files to be archived.

            It looks like the sanity-hsm test_90 exception was removed in the autotest framework, so it is possible to re-enable this test along with a patch to fix it.

            adilger Andreas Dilger added a comment - It looks like the sanity-hsm test_90 exception was removed in the autotest framework, so it is possible to re-enable this test along with a patch to fix it.

            People

              jamesanunez James Nunez (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: