Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4093

sanity-hsm test_24d: wanted 'SUCCEED' got 'WAITING'

Details

    • 3
    • 10999

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/04c602fa-3258-11e3-9de6-52540035b04c.

      The sub-test test_24d failed with the following error:

      Cannot send HSM request (use of /mnt/lustre2/d0.sanity-hsm/d24/f.sanity-hsm.24d): Read-only file system
      :
      :
      CMD: wtm-27vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200008101:0x1f:0x0'.*action='ARCHIVE'/ { print $13 }' | cut -f2 -d=
      Update not seen after 100s: wanted 'SUCCEED' got 'WAITING'
      sanity-hsm test_24d: @@@@@@ FAIL: request on 0x200008101:0x1f:0x0 is not SUCCEED

      Info required for matching: sanity-hsm 24d

      Attachments

        Issue Links

          Activity

            [LU-4093] sanity-hsm test_24d: wanted 'SUCCEED' got 'WAITING'
            yujian Jian Yu added a comment -

            Also, I think we need to change the link of the failures you listed from LU-4093 to LU-5622, what do you think ?

            Agreed. Just changed. Thank you, Bruno.

            yujian Jian Yu added a comment - Also, I think we need to change the link of the failures you listed from LU-4093 to LU-5622 , what do you think ? Agreed. Just changed. Thank you, Bruno.

            Just created LU-5622 to address this new problem.

            Just want to add that the fact it seems to occur more frequently between test_24c and test_24d sub-tests switch may come from specific/enhanced cleanup actions in test_24c ...

            Also, I think we need to change the link of the failures you listed from LU-4093 to LU-5622, what do you think ?

            bfaccini Bruno Faccini (Inactive) added a comment - Just created LU-5622 to address this new problem. Just want to add that the fact it seems to occur more frequently between test_24c and test_24d sub-tests switch may come from specific/enhanced cleanup actions in test_24c ... Also, I think we need to change the link of the failures you listed from LU-4093 to LU-5622 , what do you think ?
            yujian Jian Yu added a comment -

            If you agree, I will create ticket, assign to myself, add you in watchers list, and push a fix soon.

            Thank you, Bruno. Please do.

            yujian Jian Yu added a comment - If you agree, I will create ticket, assign to myself, add you in watchers list, and push a fix soon. Thank you, Bruno. Please do.

            Hello Jian, humm looks different I think, at least because in all cases you noticed the copytool_log is not present for test_24d !

            And if I look precisely into the logs/msgs, "Wakeup copytool agt1 on ..." should not be printed after a successfull copytool_cleanup, from previous sub-test, and copytool_setup, within new sub-test, in sequence because the implied "pkill" command should have fail if no copytool thread from previous sub-test can be found after its stop/kill in copytool_cleanup.
            So a possible scenario is that some copytool thread from previous test_24c sub-test must be somewhat stuck and still not terminated at the time test_24d tries to restart copytool ... But finally dies, and in fact we may finally run test_24d with no copytool started !

            So this should be addressed in a new ticket, and the fix should be to ensure full death of ALL copytool threads in copytool_cleanup().
            If you agree, I will create ticket, assign to myself, add you in watchers list, and push a fix soon.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Jian, humm looks different I think, at least because in all cases you noticed the copytool_log is not present for test_24d ! And if I look precisely into the logs/msgs, "Wakeup copytool agt1 on ..." should not be printed after a successfull copytool_cleanup, from previous sub-test, and copytool_setup, within new sub-test, in sequence because the implied "pkill" command should have fail if no copytool thread from previous sub-test can be found after its stop/kill in copytool_cleanup. So a possible scenario is that some copytool thread from previous test_24c sub-test must be somewhat stuck and still not terminated at the time test_24d tries to restart copytool ... But finally dies, and in fact we may finally run test_24d with no copytool started ! So this should be addressed in a new ticket, and the fix should be to ensure full death of ALL copytool threads in copytool_cleanup(). If you agree, I will create ticket, assign to myself, add you in watchers list, and push a fix soon.

            It looks like I just hit this issue with b2_5 in review-zfs at https://maloo.whamcloud.com/test_sets/e11b4944-c822-11e3-888b-52540035b04c

            jamesanunez James Nunez (Inactive) added a comment - It looks like I just hit this issue with b2_5 in review-zfs at https://maloo.whamcloud.com/test_sets/e11b4944-c822-11e3-888b-52540035b04c

            No more related failures for master builds reported since Nov 18th. So as expected the 2 changes for LU-4093 made it.

            bfaccini Bruno Faccini (Inactive) added a comment - No more related failures for master builds reported since Nov 18th. So as expected the 2 changes for LU-4093 made it.

            http://review.whamcloud.com/8329 has just landed, so need to wait a week or 2 to verify its effects and close.

            bfaccini Bruno Faccini (Inactive) added a comment - http://review.whamcloud.com/8329 has just landed, so need to wait a week or 2 to verify its effects and close.

            Humm thanks Andreas to chase this, seems that original patch #8157 (patch-set #3) has a typo (MDT_HSMCTRL vs mdt_hsmctrl) and an inverted test ([echo $oldstate | grep stop || continue] vs [echo $oldstate | grep stop && continue]) that should prevent it to work as expected and rather with reverted logic (only stopped CDT will be stopped) …
            The fact itself passed auto-tests could be due to the condition it is intended to fix did not show-up during its own testing !!
            Just pushed http://review.whamcloud.com/8329 to fix this.

            bfaccini Bruno Faccini (Inactive) added a comment - Humm thanks Andreas to chase this, seems that original patch #8157 (patch-set #3) has a typo (MDT_HSMCTRL vs mdt_hsmctrl) and an inverted test ( [echo $oldstate | grep stop || continue] vs [echo $oldstate | grep stop && continue] ) that should prevent it to work as expected and rather with reverted logic (only stopped CDT will be stopped) … The fact itself passed auto-tests could be due to the condition it is intended to fix did not show-up during its own testing !! Just pushed http://review.whamcloud.com/8329 to fix this.

            Patch 8157 has landed on 2013-11-13, but I still see tests failing with LU-4093/LU-4126 in the past few days. Is that just because the test queue is so long that the results we are seeing today are for patches that were based on code not including the fix?

            If that can be verified by checking the parent commit of recent test failures does NOT include change 8157, then I guess this bug can be closed. It looks from the results that LU-4093 shows up as lustre-hsm passing only 73/91 tests. There are still cases with 90/91 tests passing, so that needs to be a separate bug.

            adilger Andreas Dilger added a comment - Patch 8157 has landed on 2013-11-13, but I still see tests failing with LU-4093 / LU-4126 in the past few days. Is that just because the test queue is so long that the results we are seeing today are for patches that were based on code not including the fix? If that can be verified by checking the parent commit of recent test failures does NOT include change 8157, then I guess this bug can be closed. It looks from the results that LU-4093 shows up as lustre-hsm passing only 73/91 tests. There are still cases with 90/91 tests passing, so that needs to be a separate bug.

            Fix to prevent zombie requests during CT/copytool restart is at http://review.whamcloud.com/8157.

            bfaccini Bruno Faccini (Inactive) added a comment - Fix to prevent zombie requests during CT/copytool restart is at http://review.whamcloud.com/8157 .

            People

              bfaccini Bruno Faccini (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: