Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4093

sanity-hsm test_24d: wanted 'SUCCEED' got 'WAITING'

Details

    • 3
    • 10999

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/04c602fa-3258-11e3-9de6-52540035b04c.

      The sub-test test_24d failed with the following error:

      Cannot send HSM request (use of /mnt/lustre2/d0.sanity-hsm/d24/f.sanity-hsm.24d): Read-only file system
      :
      :
      CMD: wtm-27vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200008101:0x1f:0x0'.*action='ARCHIVE'/ { print $13 }' | cut -f2 -d=
      Update not seen after 100s: wanted 'SUCCEED' got 'WAITING'
      sanity-hsm test_24d: @@@@@@ FAIL: request on 0x200008101:0x1f:0x0 is not SUCCEED

      Info required for matching: sanity-hsm 24d

      Attachments

        Issue Links

          Activity

            [LU-4093] sanity-hsm test_24d: wanted 'SUCCEED' got 'WAITING'
            yujian Jian Yu added a comment -

            Also, I think we need to change the link of the failures you listed from LU-4093 to LU-5622, what do you think ?

            Agreed. Just changed. Thank you, Bruno.

            yujian Jian Yu added a comment - Also, I think we need to change the link of the failures you listed from LU-4093 to LU-5622 , what do you think ? Agreed. Just changed. Thank you, Bruno.

            Just created LU-5622 to address this new problem.

            Just want to add that the fact it seems to occur more frequently between test_24c and test_24d sub-tests switch may come from specific/enhanced cleanup actions in test_24c ...

            Also, I think we need to change the link of the failures you listed from LU-4093 to LU-5622, what do you think ?

            bfaccini Bruno Faccini (Inactive) added a comment - Just created LU-5622 to address this new problem. Just want to add that the fact it seems to occur more frequently between test_24c and test_24d sub-tests switch may come from specific/enhanced cleanup actions in test_24c ... Also, I think we need to change the link of the failures you listed from LU-4093 to LU-5622 , what do you think ?
            yujian Jian Yu added a comment -

            If you agree, I will create ticket, assign to myself, add you in watchers list, and push a fix soon.

            Thank you, Bruno. Please do.

            yujian Jian Yu added a comment - If you agree, I will create ticket, assign to myself, add you in watchers list, and push a fix soon. Thank you, Bruno. Please do.

            Hello Jian, humm looks different I think, at least because in all cases you noticed the copytool_log is not present for test_24d !

            And if I look precisely into the logs/msgs, "Wakeup copytool agt1 on ..." should not be printed after a successfull copytool_cleanup, from previous sub-test, and copytool_setup, within new sub-test, in sequence because the implied "pkill" command should have fail if no copytool thread from previous sub-test can be found after its stop/kill in copytool_cleanup.
            So a possible scenario is that some copytool thread from previous test_24c sub-test must be somewhat stuck and still not terminated at the time test_24d tries to restart copytool ... But finally dies, and in fact we may finally run test_24d with no copytool started !

            So this should be addressed in a new ticket, and the fix should be to ensure full death of ALL copytool threads in copytool_cleanup().
            If you agree, I will create ticket, assign to myself, add you in watchers list, and push a fix soon.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Jian, humm looks different I think, at least because in all cases you noticed the copytool_log is not present for test_24d ! And if I look precisely into the logs/msgs, "Wakeup copytool agt1 on ..." should not be printed after a successfull copytool_cleanup, from previous sub-test, and copytool_setup, within new sub-test, in sequence because the implied "pkill" command should have fail if no copytool thread from previous sub-test can be found after its stop/kill in copytool_cleanup. So a possible scenario is that some copytool thread from previous test_24c sub-test must be somewhat stuck and still not terminated at the time test_24d tries to restart copytool ... But finally dies, and in fact we may finally run test_24d with no copytool started ! So this should be addressed in a new ticket, and the fix should be to ensure full death of ALL copytool threads in copytool_cleanup(). If you agree, I will create ticket, assign to myself, add you in watchers list, and push a fix soon.

            It looks like I just hit this issue with b2_5 in review-zfs at https://maloo.whamcloud.com/test_sets/e11b4944-c822-11e3-888b-52540035b04c

            jamesanunez James Nunez (Inactive) added a comment - It looks like I just hit this issue with b2_5 in review-zfs at https://maloo.whamcloud.com/test_sets/e11b4944-c822-11e3-888b-52540035b04c

            No more related failures for master builds reported since Nov 18th. So as expected the 2 changes for LU-4093 made it.

            bfaccini Bruno Faccini (Inactive) added a comment - No more related failures for master builds reported since Nov 18th. So as expected the 2 changes for LU-4093 made it.

            http://review.whamcloud.com/8329 has just landed, so need to wait a week or 2 to verify its effects and close.

            bfaccini Bruno Faccini (Inactive) added a comment - http://review.whamcloud.com/8329 has just landed, so need to wait a week or 2 to verify its effects and close.

            Humm thanks Andreas to chase this, seems that original patch #8157 (patch-set #3) has a typo (MDT_HSMCTRL vs mdt_hsmctrl) and an inverted test ([echo $oldstate | grep stop || continue] vs [echo $oldstate | grep stop && continue]) that should prevent it to work as expected and rather with reverted logic (only stopped CDT will be stopped) …
            The fact itself passed auto-tests could be due to the condition it is intended to fix did not show-up during its own testing !!
            Just pushed http://review.whamcloud.com/8329 to fix this.

            bfaccini Bruno Faccini (Inactive) added a comment - Humm thanks Andreas to chase this, seems that original patch #8157 (patch-set #3) has a typo (MDT_HSMCTRL vs mdt_hsmctrl) and an inverted test ( [echo $oldstate | grep stop || continue] vs [echo $oldstate | grep stop && continue] ) that should prevent it to work as expected and rather with reverted logic (only stopped CDT will be stopped) … The fact itself passed auto-tests could be due to the condition it is intended to fix did not show-up during its own testing !! Just pushed http://review.whamcloud.com/8329 to fix this.

            People

              bfaccini Bruno Faccini (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: