Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4839

Test failure sanity-hsm test_60: Timed out waiting for progress update

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.7.0, Lustre 2.5.3
    • 3
    • 13335

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run:
      http://maloo.whamcloud.com/test_sets/a72e2f24-a25e-11e3-80d1-52540035b04c
      https://maloo.whamcloud.com/test_sets/e3e8358a-b803-11e3-8f4e-52540035b04c

      The sub-test test_60 failed with the following error:

      Timed out waiting for progress update!

      Info required for matching: sanity-hsm 60

      Attachments

        Activity

          [LU-4839] Test failure sanity-hsm test_60: Timed out waiting for progress update
          pjones Peter Jones added a comment -

          Landed for 2.8

          pjones Peter Jones added a comment - Landed for 2.8

          Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13962
          Subject: LU-4839 tests: wait for copytool start sanity-hsm/60
          Project: fs/lustre-release
          Branch: b2_5
          Current Patch Set: 1
          Commit: 9a9e80b5518a36790a4472486b81a5162c2277a3

          gerrit Gerrit Updater added a comment - Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13962 Subject: LU-4839 tests: wait for copytool start sanity-hsm/60 Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 9a9e80b5518a36790a4472486b81a5162c2277a3

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13731/
          Subject: LU-4839 tests: wait for copytool start sanity-hsm/60
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: d0636eede1ab340421177c6a97ec27689099f953

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13731/ Subject: LU-4839 tests: wait for copytool start sanity-hsm/60 Project: fs/lustre-release Branch: master Current Patch Set: Commit: d0636eede1ab340421177c6a97ec27689099f953
          bfaccini Bruno Faccini (Inactive) added a comment - - edited

          Nasf, Nathaniel,
          When browsing for LU-6203/test_251 issues, I found that both+only sanity-hsm/test_[60,251] have failed in this last autotests session. Having a look in the debug_traces for test_60 I found that we face the same issue I already strongly suspect in LU-6203, which is that the lock flush/cancel between Client running the sub-test (and creating the file being archived) and the OSS/OST can take more than 20s causing the test failure because copytool has been stuck during that time waiting to access/copy datas for the archive operation.
          So I wonder if the solution for this ticket could not be the same I thought for LU-6203/test_251, to add a cancel_lru_locks() before the hsm_archive in order to flush/cancel locks early.

          BTW, I don't know what causes such delay, looks like it only occurs with ZFS and could also be related to some VM/disk issue...

          bfaccini Bruno Faccini (Inactive) added a comment - - edited Nasf, Nathaniel, When browsing for LU-6203 /test_251 issues, I found that both+only sanity-hsm/test_ [60,251] have failed in this last autotests session. Having a look in the debug_traces for test_60 I found that we face the same issue I already strongly suspect in LU-6203 , which is that the lock flush/cancel between Client running the sub-test (and creating the file being archived) and the OSS/OST can take more than 20s causing the test failure because copytool has been stuck during that time waiting to access/copy datas for the archive operation. So I wonder if the solution for this ticket could not be the same I thought for LU-6203 /test_251, to add a cancel_lru_locks() before the hsm_archive in order to flush/cancel locks early. BTW, I don't know what causes such delay, looks like it only occurs with ZFS and could also be related to some VM/disk issue...
          yong.fan nasf (Inactive) added a comment - Another failure instance on b2_5: https://testing.hpdd.intel.com/test_sets/a234c7d0-bd24-11e4-a946-5254006e85c2

          Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13731
          Subject: LU-4839 tests: wait for copytool start sanity-hsm/60
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: bd32b390a4077db4c763148f14855a96b9ec8545

          gerrit Gerrit Updater added a comment - Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13731 Subject: LU-4839 tests: wait for copytool start sanity-hsm/60 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bd32b390a4077db4c763148f14855a96b9ec8545

          Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13214
          Subject: LU-4839 utils: DEBUG ONLY add debugging to hsmtool
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: ae604857d0a7949db0c78e2a4df8d7309250b830

          gerrit Gerrit Updater added a comment - Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13214 Subject: LU-4839 utils: DEBUG ONLY add debugging to hsmtool Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ae604857d0a7949db0c78e2a4df8d7309250b830
          utopiabound Nathaniel Clark added a comment - - edited

          Unfortunately the wait time can't be increased much more without compromising the test. Since the test is trying to ensure that updates happen every 5 seconds instead of the default 30. I've already pushed the wait time up to 20 seconds. There seems to be a significant delay between "archiving" and "saving striping info":

          1417489810.840712 lhsmtool_posix[8894]: processing file 'd60.sanity-hsm/f60.sanity-hsm'
          1417489810.895598 lhsmtool_posix[8894]: archiving '/mnt/lustre/.lustre/fid/0x400000401:0x1c2:0x0' to '/home/autotest2/.autotest/shared_dir/2014-12-01/143223-70364285431720/arc1/01c2/0000/0401/0000/0004/0000/0x400000401:0x1c2:0x0_tmp'
          1417489841.241595 lhsmtool_posix[8894]: saving stripe info of '/mnt/lustre/.lustre/fid/0x400000401:0x1c2:0x0' in /home/autotest2/.autotest/shared_dir/2014-12-01/143223-70364285431720/arc1/01c2/0000/0401/0000/0004/0000/0x400000401:0x1c2:0x0_tmp.lov
          1417489845.934025 lhsmtool_posix[8894]: start copy of 39000000 bytes from '/mnt/lustre/.lustre/fid/0x400000401:0x1c2:0x0' to '/home/autotest2/.autotest/shared_dir/2014-12-01/143223-70364285431720/arc1/01c2/0000/0401/0000/0004/0000/0x400000401:0x1c2:0x0_tmp'
          1417489850.089915 lhsmtool_posix[8894]: %13 
          

          This step is all about creating destination directories and opening for write the destination file (except for an open_by_fid). My current theory is that the issue resides with NFS being the shared directory that copytool wants to write to.

          utopiabound Nathaniel Clark added a comment - - edited Unfortunately the wait time can't be increased much more without compromising the test. Since the test is trying to ensure that updates happen every 5 seconds instead of the default 30. I've already pushed the wait time up to 20 seconds. There seems to be a significant delay between "archiving" and "saving striping info": 1417489810.840712 lhsmtool_posix[8894]: processing file 'd60.sanity-hsm/f60.sanity-hsm' 1417489810.895598 lhsmtool_posix[8894]: archiving '/mnt/lustre/.lustre/fid/0x400000401:0x1c2:0x0' to '/home/autotest2/.autotest/shared_dir/2014-12-01/143223-70364285431720/arc1/01c2/0000/0401/0000/0004/0000/0x400000401:0x1c2:0x0_tmp' 1417489841.241595 lhsmtool_posix[8894]: saving stripe info of '/mnt/lustre/.lustre/fid/0x400000401:0x1c2:0x0' in /home/autotest2/.autotest/shared_dir/2014-12-01/143223-70364285431720/arc1/01c2/0000/0401/0000/0004/0000/0x400000401:0x1c2:0x0_tmp.lov 1417489845.934025 lhsmtool_posix[8894]: start copy of 39000000 bytes from '/mnt/lustre/.lustre/fid/0x400000401:0x1c2:0x0' to '/home/autotest2/.autotest/shared_dir/2014-12-01/143223-70364285431720/arc1/01c2/0000/0401/0000/0004/0000/0x400000401:0x1c2:0x0_tmp' 1417489850.089915 lhsmtool_posix[8894]: %13 This step is all about creating destination directories and opening for write the destination file (except for an open_by_fid). My current theory is that the issue resides with NFS being the shared directory that copytool wants to write to.

          Nathaniel, is it possible the test still isn't giving enough time for this to pass on review-zfs? This seems like one of the more common failures in review-zfs, so if we increase the wait time only for ZFS backed filesystems it will hopefully allow more passes (assuming there isn't some other real failure here, I haven't looked into the logs).

          adilger Andreas Dilger added a comment - Nathaniel, is it possible the test still isn't giving enough time for this to pass on review-zfs? This seems like one of the more common failures in review-zfs, so if we increase the wait time only for ZFS backed filesystems it will hopefully allow more passes (assuming there isn't some other real failure here, I haven't looked into the logs).
          adilger Andreas Dilger added a comment - Still seeing this test fail on master. 7x in the past week: https://testing.hpdd.intel.com/test_sets/66760558-81f4-11e4-90bc-5254006e85c2 https://testing.hpdd.intel.com/test_sets/b432c4fe-80cb-11e4-9ec8-5254006e85c2 https://testing.hpdd.intel.com/test_sets/e5c6de08-801f-11e4-b486-5254006e85c2 https://testing.hpdd.intel.com/test_sets/8547fb7e-80c0-11e4-a434-5254006e85c2 https://testing.hpdd.intel.com/test_sets/3b87e076-79db-11e4-9e8a-5254006e85c2 https://testing.hpdd.intel.com/test_sets/5eb0c2c8-79e7-11e4-807e-5254006e85c2

          People

            utopiabound Nathaniel Clark
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: