Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5474

Test failure sanity-hsm test_90: requests did not complete

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.7.0, Lustre 2.5.4
    • Lustre 2.7.0, Lustre 2.5.4
    • 3
    • 15251

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run:
      https://testing.hpdd.intel.com/test_sets/90f1f8a2-1ff0-11e4-8610-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/29b71c62-020e-11e4-a47c-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/e775fca2-021b-11e4-9435-5254006e85c2

      The sub-test test_90 failed with the following error:

      requests did not complete

      Info required for matching: sanity-hsm 90

      Attachments

        Issue Links

          Activity

            [LU-5474] Test failure sanity-hsm test_90: requests did not complete

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12963/
            Subject: LU-5474 tests: sanity-hsm test_90 use local HSM_ARCHIVE
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set:
            Commit: 54a78152b70b54ec574774eef97dcf4c27b06d5b

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12963/ Subject: LU-5474 tests: sanity-hsm test_90 use local HSM_ARCHIVE Project: fs/lustre-release Branch: b2_5 Current Patch Set: Commit: 54a78152b70b54ec574774eef97dcf4c27b06d5b
            yujian Jian Yu added a comment - - edited The failure occurred frequently recently on Lustre b2_5 branch: https://testing.hpdd.intel.com/test_sets/b9a2d466-8327-11e4-aa2f-5254006e85c2 https://testing.hpdd.intel.com/test_sets/0e7e5682-8318-11e4-9195-5254006e85c2 https://testing.hpdd.intel.com/test_sets/3e1b4c7e-830e-11e4-a45f-5254006e85c2 https://testing.hpdd.intel.com/test_sets/28e422d4-8414-11e4-84b4-5254006e85c2

            Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/12963
            Subject: LU-5474 tests: sanity-hsm test_90 use local HSM_ARCHIVE
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: b8360cf33b18594c765e2fd55e98a591fdc50fdb

            gerrit Gerrit Updater added a comment - Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/12963 Subject: LU-5474 tests: sanity-hsm test_90 use local HSM_ARCHIVE Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: b8360cf33b18594c765e2fd55e98a591fdc50fdb
            pjones Peter Jones added a comment -

            Landed for 2.7

            pjones Peter Jones added a comment - Landed for 2.7

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12069/
            Subject: LU-5474 tests: sanity-hsm test_90 use local HSM_ARCHIVE
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 8641822bdb248a034bf4065c7e2e60cdb7a47041

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12069/ Subject: LU-5474 tests: sanity-hsm test_90 use local HSM_ARCHIVE Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8641822bdb248a034bf4065c7e2e60cdb7a47041

            I've updated sanity-hsm test 90 to use a local disk for archive as was implemented in test 40. For the last two runs with this change, test 90 has failed to archive any files from the file list; https://testing.hpdd.intel.com/test_sessions/5ac85a16-6ee0-11e4-81e7-5254006e85c2 and https://testing.hpdd.intel.com/test_sets/616b778e-6b77-11e4-b1b4-5254006e85c2 .

            For test 90, it seems that the archive number specified to the copytool is not the same number as the archive requests. The archive number specified in the call to the copytool is #2 and the archive requests are looking for archive #1. From the client test log:

            CMD: shadow-3vm9 lhsmtool_posix  --daemon --hsm-root /tmp/d90.sanity-hsm --archive 2 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2014-11-17/lustre-reviews-el6-x86_64--review-dne-part-2--2_9_1__28407__-70078147694660-145306/sanity-hsm.test_90.copytool2_log.shadow-3vm9.log 2>&1
            ...
            Changed after 8s: from 'lrh=[type=10680000 len=136 idx=1/1846] fid=[0x400000401:0x1c4:0x0] dfid=[0x400000401:0x1c4:0x0] compound/cookie=0x546ab2b6/0x546ab2b6 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[]
            lrh=[type=10680000 len=136 idx=1/1847] fid=[0x400000401:0x1c5:0x0] dfid=[0x400000401:0x1c5:0x0] compound/cookie=0x546ab2b6/0x546ab2b7 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[]
            

            On the MDS, it complains that it can't find the agent, which is true, archive #2 was started by the copytool:

            LustreError: 14430:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) lustre-MDT0000: Cannot find agent for archive 1: rc = -11
            

            Yet, test 40 uses essentially the same calls to start the copytool and the archive numbers for this test are consistent. Test 90 archives from a file list and test 40 does not.

            jamesanunez James Nunez (Inactive) added a comment - I've updated sanity-hsm test 90 to use a local disk for archive as was implemented in test 40. For the last two runs with this change, test 90 has failed to archive any files from the file list; https://testing.hpdd.intel.com/test_sessions/5ac85a16-6ee0-11e4-81e7-5254006e85c2 and https://testing.hpdd.intel.com/test_sets/616b778e-6b77-11e4-b1b4-5254006e85c2 . For test 90, it seems that the archive number specified to the copytool is not the same number as the archive requests. The archive number specified in the call to the copytool is #2 and the archive requests are looking for archive #1. From the client test log: CMD: shadow-3vm9 lhsmtool_posix --daemon --hsm-root /tmp/d90.sanity-hsm --archive 2 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2014-11-17/lustre-reviews-el6-x86_64--review-dne-part-2--2_9_1__28407__-70078147694660-145306/sanity-hsm.test_90.copytool2_log.shadow-3vm9.log 2>&1 ... Changed after 8s: from 'lrh=[type=10680000 len=136 idx=1/1846] fid=[0x400000401:0x1c4:0x0] dfid=[0x400000401:0x1c4:0x0] compound/cookie=0x546ab2b6/0x546ab2b6 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[] lrh=[type=10680000 len=136 idx=1/1847] fid=[0x400000401:0x1c5:0x0] dfid=[0x400000401:0x1c5:0x0] compound/cookie=0x546ab2b6/0x546ab2b7 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[] On the MDS, it complains that it can't find the agent, which is true, archive #2 was started by the copytool: LustreError: 14430:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) lustre-MDT0000: Cannot find agent for archive 1: rc = -11 Yet, test 40 uses essentially the same calls to start the copytool and the archive numbers for this test are consistent. Test 90 archives from a file list and test 40 does not.

            James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/12069
            Subject: LU-5474 tests: sanity-hsm test_90 use local HSM_ARCHIVE
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 9
            Commit: 7a0b4705f130f318acc0aaf6297cd4e0068060e5

            gerrit Gerrit Updater added a comment - James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/12069 Subject: LU-5474 tests: sanity-hsm test_90 use local HSM_ARCHIVE Project: fs/lustre-release Branch: master Current Patch Set: 9 Commit: 7a0b4705f130f318acc0aaf6297cd4e0068060e5

            Andreas,
            I modified http://review.whamcloud.com/#/c/12069/ to use $TMP for the archive.

            jamesanunez James Nunez (Inactive) added a comment - Andreas, I modified http://review.whamcloud.com/#/c/12069/ to use $TMP for the archive.

            Bruno, can you please make a patch to sanity-hsm test_90() to use $TMP for the HSM Archive, as was done with test_40()? Is there any reason not to make all HSM tests use $TMP for the archive by default, and only make some tests use the NFS shared directory if needed (very large archive, or HSM failover testing)? That would speed up all the HSM tests I think, unless it filled up $TMP and caused test failures.

            adilger Andreas Dilger added a comment - Bruno, can you please make a patch to sanity-hsm test_90() to use $TMP for the HSM Archive, as was done with test_40()? Is there any reason not to make all HSM tests use $TMP for the archive by default, and only make some tests use the NFS shared directory if needed (very large archive, or HSM failover testing)? That would speed up all the HSM tests I think, unless it filled up $TMP and caused test failures.
            yong.fan nasf (Inactive) added a comment - Another failure instance on b2_5: https://testing.hpdd.intel.com/test_sets/d03f6910-681e-11e4-acbe-5254006e85c2

            People

              jamesanunez James Nunez (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: