Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4839

Test failure sanity-hsm test_60: Timed out waiting for progress update

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.7.0, Lustre 2.5.3
    • 3
    • 13335

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run:
      http://maloo.whamcloud.com/test_sets/a72e2f24-a25e-11e3-80d1-52540035b04c
      https://maloo.whamcloud.com/test_sets/e3e8358a-b803-11e3-8f4e-52540035b04c

      The sub-test test_60 failed with the following error:

      Timed out waiting for progress update!

      Info required for matching: sanity-hsm 60

      Attachments

        Activity

          [LU-4839] Test failure sanity-hsm test_60: Timed out waiting for progress update
          pjones Peter Jones added a comment -

          Landed for 2.8

          pjones Peter Jones added a comment - Landed for 2.8

          Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13962
          Subject: LU-4839 tests: wait for copytool start sanity-hsm/60
          Project: fs/lustre-release
          Branch: b2_5
          Current Patch Set: 1
          Commit: 9a9e80b5518a36790a4472486b81a5162c2277a3

          gerrit Gerrit Updater added a comment - Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13962 Subject: LU-4839 tests: wait for copytool start sanity-hsm/60 Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 9a9e80b5518a36790a4472486b81a5162c2277a3

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13731/
          Subject: LU-4839 tests: wait for copytool start sanity-hsm/60
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: d0636eede1ab340421177c6a97ec27689099f953

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13731/ Subject: LU-4839 tests: wait for copytool start sanity-hsm/60 Project: fs/lustre-release Branch: master Current Patch Set: Commit: d0636eede1ab340421177c6a97ec27689099f953
          bfaccini Bruno Faccini (Inactive) added a comment - - edited

          Nasf, Nathaniel,
          When browsing for LU-6203/test_251 issues, I found that both+only sanity-hsm/test_[60,251] have failed in this last autotests session. Having a look in the debug_traces for test_60 I found that we face the same issue I already strongly suspect in LU-6203, which is that the lock flush/cancel between Client running the sub-test (and creating the file being archived) and the OSS/OST can take more than 20s causing the test failure because copytool has been stuck during that time waiting to access/copy datas for the archive operation.
          So I wonder if the solution for this ticket could not be the same I thought for LU-6203/test_251, to add a cancel_lru_locks() before the hsm_archive in order to flush/cancel locks early.

          BTW, I don't know what causes such delay, looks like it only occurs with ZFS and could also be related to some VM/disk issue...

          bfaccini Bruno Faccini (Inactive) added a comment - - edited Nasf, Nathaniel, When browsing for LU-6203 /test_251 issues, I found that both+only sanity-hsm/test_ [60,251] have failed in this last autotests session. Having a look in the debug_traces for test_60 I found that we face the same issue I already strongly suspect in LU-6203 , which is that the lock flush/cancel between Client running the sub-test (and creating the file being archived) and the OSS/OST can take more than 20s causing the test failure because copytool has been stuck during that time waiting to access/copy datas for the archive operation. So I wonder if the solution for this ticket could not be the same I thought for LU-6203 /test_251, to add a cancel_lru_locks() before the hsm_archive in order to flush/cancel locks early. BTW, I don't know what causes such delay, looks like it only occurs with ZFS and could also be related to some VM/disk issue...
          yong.fan nasf (Inactive) added a comment - Another failure instance on b2_5: https://testing.hpdd.intel.com/test_sets/a234c7d0-bd24-11e4-a946-5254006e85c2

          Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13731
          Subject: LU-4839 tests: wait for copytool start sanity-hsm/60
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: bd32b390a4077db4c763148f14855a96b9ec8545

          gerrit Gerrit Updater added a comment - Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13731 Subject: LU-4839 tests: wait for copytool start sanity-hsm/60 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bd32b390a4077db4c763148f14855a96b9ec8545

          Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13214
          Subject: LU-4839 utils: DEBUG ONLY add debugging to hsmtool
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: ae604857d0a7949db0c78e2a4df8d7309250b830

          gerrit Gerrit Updater added a comment - Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/13214 Subject: LU-4839 utils: DEBUG ONLY add debugging to hsmtool Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ae604857d0a7949db0c78e2a4df8d7309250b830
          utopiabound Nathaniel Clark added a comment - - edited

          Unfortunately the wait time can't be increased much more without compromising the test. Since the test is trying to ensure that updates happen every 5 seconds instead of the default 30. I've already pushed the wait time up to 20 seconds. There seems to be a significant delay between "archiving" and "saving striping info":

          1417489810.840712 lhsmtool_posix[8894]: processing file 'd60.sanity-hsm/f60.sanity-hsm'
          1417489810.895598 lhsmtool_posix[8894]: archiving '/mnt/lustre/.lustre/fid/0x400000401:0x1c2:0x0' to '/home/autotest2/.autotest/shared_dir/2014-12-01/143223-70364285431720/arc1/01c2/0000/0401/0000/0004/0000/0x400000401:0x1c2:0x0_tmp'
          1417489841.241595 lhsmtool_posix[8894]: saving stripe info of '/mnt/lustre/.lustre/fid/0x400000401:0x1c2:0x0' in /home/autotest2/.autotest/shared_dir/2014-12-01/143223-70364285431720/arc1/01c2/0000/0401/0000/0004/0000/0x400000401:0x1c2:0x0_tmp.lov
          1417489845.934025 lhsmtool_posix[8894]: start copy of 39000000 bytes from '/mnt/lustre/.lustre/fid/0x400000401:0x1c2:0x0' to '/home/autotest2/.autotest/shared_dir/2014-12-01/143223-70364285431720/arc1/01c2/0000/0401/0000/0004/0000/0x400000401:0x1c2:0x0_tmp'
          1417489850.089915 lhsmtool_posix[8894]: %13 
          

          This step is all about creating destination directories and opening for write the destination file (except for an open_by_fid). My current theory is that the issue resides with NFS being the shared directory that copytool wants to write to.

          utopiabound Nathaniel Clark added a comment - - edited Unfortunately the wait time can't be increased much more without compromising the test. Since the test is trying to ensure that updates happen every 5 seconds instead of the default 30. I've already pushed the wait time up to 20 seconds. There seems to be a significant delay between "archiving" and "saving striping info": 1417489810.840712 lhsmtool_posix[8894]: processing file 'd60.sanity-hsm/f60.sanity-hsm' 1417489810.895598 lhsmtool_posix[8894]: archiving '/mnt/lustre/.lustre/fid/0x400000401:0x1c2:0x0' to '/home/autotest2/.autotest/shared_dir/2014-12-01/143223-70364285431720/arc1/01c2/0000/0401/0000/0004/0000/0x400000401:0x1c2:0x0_tmp' 1417489841.241595 lhsmtool_posix[8894]: saving stripe info of '/mnt/lustre/.lustre/fid/0x400000401:0x1c2:0x0' in /home/autotest2/.autotest/shared_dir/2014-12-01/143223-70364285431720/arc1/01c2/0000/0401/0000/0004/0000/0x400000401:0x1c2:0x0_tmp.lov 1417489845.934025 lhsmtool_posix[8894]: start copy of 39000000 bytes from '/mnt/lustre/.lustre/fid/0x400000401:0x1c2:0x0' to '/home/autotest2/.autotest/shared_dir/2014-12-01/143223-70364285431720/arc1/01c2/0000/0401/0000/0004/0000/0x400000401:0x1c2:0x0_tmp' 1417489850.089915 lhsmtool_posix[8894]: %13 This step is all about creating destination directories and opening for write the destination file (except for an open_by_fid). My current theory is that the issue resides with NFS being the shared directory that copytool wants to write to.

          People

            utopiabound Nathaniel Clark
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: