[LU-4093] sanity-hsm test_24d: wanted 'SUCCEED' got 'WAITING' - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1
Affects Version/s: Lustre 2.5.0
Labels:
- HSM
- zfs

Severity:
3
Rank (Obsolete):
10999

Description

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/04c602fa-3258-11e3-9de6-52540035b04c.

The sub-test test_24d failed with the following error:

Cannot send HSM request (use of /mnt/lustre2/d0.sanity-hsm/d24/f.sanity-hsm.24d): Read-only file system
:
:
CMD: wtm-27vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200008101:0x1f:0x0'.*action='ARCHIVE'/ { print $13 }' | cut -f2 -d=
Update not seen after 100s: wanted 'SUCCEED' got 'WAITING'
sanity-hsm test_24d: @@@@@@ FAIL: request on 0x200008101:0x1f:0x0 is not SUCCEED

Info required for matching: sanity-hsm 24d

Attachments

Issue Links

is duplicated by

LU-4235 Test failure on test suite sanity-hsm, subtest test_24d

Resolved

is related to

LU-4126 sanity-hsm test_15 failure: 'requests did not complete'

Resolved

Activity

[LU-4093] sanity-hsm test_24d: wanted 'SUCCEED' got 'WAITING'

Jian Yu added a comment - 15/Sep/14 5:09 PM

Also, I think we need to change the link of the failures you listed from ~~LU-4093~~ to ~~LU-5622~~, what do you think ?

Agreed. Just changed. Thank you, Bruno.

Jian Yu added a comment - 15/Sep/14 5:09 PM Also, I think we need to change the link of the failures you listed from LU-4093 to LU-5622 , what do you think ? Agreed. Just changed. Thank you, Bruno.

Bruno Faccini (Inactive) added a comment - 15/Sep/14 3:28 PM

Just created ~~LU-5622~~ to address this new problem.

Just want to add that the fact it seems to occur more frequently between test_24c and test_24d sub-tests switch may come from specific/enhanced cleanup actions in test_24c ...

Also, I think we need to change the link of the failures you listed from ~~LU-4093~~ to ~~LU-5622~~, what do you think ?

Bruno Faccini (Inactive) added a comment - 15/Sep/14 3:28 PM Just created LU-5622 to address this new problem. Just want to add that the fact it seems to occur more frequently between test_24c and test_24d sub-tests switch may come from specific/enhanced cleanup actions in test_24c ... Also, I think we need to change the link of the failures you listed from LU-4093 to LU-5622 , what do you think ?

Jian Yu added a comment - 14/Sep/14 6:30 AM

If you agree, I will create ticket, assign to myself, add you in watchers list, and push a fix soon.

Thank you, Bruno. Please do.

Jian Yu added a comment - 14/Sep/14 6:30 AM If you agree, I will create ticket, assign to myself, add you in watchers list, and push a fix soon. Thank you, Bruno. Please do.

Bruno Faccini (Inactive) added a comment - 13/Sep/14 2:53 PM

Hello Jian, humm looks different I think, at least because in all cases you noticed the copytool_log is not present for test_24d !

And if I look precisely into the logs/msgs, "Wakeup copytool agt1 on ..." should not be printed after a successfull copytool_cleanup, from previous sub-test, and copytool_setup, within new sub-test, in sequence because the implied "pkill" command should have fail if no copytool thread from previous sub-test can be found after its stop/kill in copytool_cleanup.
So a possible scenario is that some copytool thread from previous test_24c sub-test must be somewhat stuck and still not terminated at the time test_24d tries to restart copytool ... But finally dies, and in fact we may finally run test_24d with no copytool started !

So this should be addressed in a new ticket, and the fix should be to ensure full death of ALL copytool threads in copytool_cleanup().
If you agree, I will create ticket, assign to myself, add you in watchers list, and push a fix soon.

Bruno Faccini (Inactive) added a comment - 13/Sep/14 2:53 PM Hello Jian, humm looks different I think, at least because in all cases you noticed the copytool_log is not present for test_24d ! And if I look precisely into the logs/msgs, "Wakeup copytool agt1 on ..." should not be printed after a successfull copytool_cleanup, from previous sub-test, and copytool_setup, within new sub-test, in sequence because the implied "pkill" command should have fail if no copytool thread from previous sub-test can be found after its stop/kill in copytool_cleanup. So a possible scenario is that some copytool thread from previous test_24c sub-test must be somewhat stuck and still not terminated at the time test_24d tries to restart copytool ... But finally dies, and in fact we may finally run test_24d with no copytool started ! So this should be addressed in a new ticket, and the fix should be to ensure full death of ALL copytool threads in copytool_cleanup(). If you agree, I will create ticket, assign to myself, add you in watchers list, and push a fix soon.

James Nunez (Inactive) added a comment - 23/Apr/14 4:07 PM

It looks like I just hit this issue with b2_5 in review-zfs at https://maloo.whamcloud.com/test_sets/e11b4944-c822-11e3-888b-52540035b04c

James Nunez (Inactive) added a comment - 23/Apr/14 4:07 PM It looks like I just hit this issue with b2_5 in review-zfs at https://maloo.whamcloud.com/test_sets/e11b4944-c822-11e3-888b-52540035b04c

Bruno Faccini (Inactive) added a comment - 10/Dec/13 9:16 AM

No more related failures for master builds reported since Nov 18th. So as expected the 2 changes for ~~LU-4093~~ made it.

Bruno Faccini (Inactive) added a comment - 10/Dec/13 9:16 AM No more related failures for master builds reported since Nov 18th. So as expected the 2 changes for LU-4093 made it.

Bruno Faccini (Inactive) added a comment - 28/Nov/13 1:45 PM

http://review.whamcloud.com/8329 has just landed, so need to wait a week or 2 to verify its effects and close.

Bruno Faccini (Inactive) added a comment - 28/Nov/13 1:45 PM http://review.whamcloud.com/8329 has just landed, so need to wait a week or 2 to verify its effects and close.

Bruno Faccini (Inactive) added a comment - 19/Nov/13 11:19 AM

Humm thanks Andreas to chase this, seems that original patch #8157 (patch-set #3) has a typo (MDT_HSMCTRL vs mdt_hsmctrl) and an inverted test ([echo $oldstate | grep stop || continue] vs [echo $oldstate | grep stop && continue]) that should prevent it to work as expected and rather with reverted logic (only stopped CDT will be stopped) …
The fact itself passed auto-tests could be due to the condition it is intended to fix did not show-up during its own testing !!
Just pushed http://review.whamcloud.com/8329 to fix this.

Bruno Faccini (Inactive) added a comment - 19/Nov/13 11:19 AM Humm thanks Andreas to chase this, seems that original patch #8157 (patch-set #3) has a typo (MDT_HSMCTRL vs mdt_hsmctrl) and an inverted test ( [echo $oldstate | grep stop || continue] vs [echo $oldstate | grep stop && continue] ) that should prevent it to work as expected and rather with reverted logic (only stopped CDT will be stopped) … The fact itself passed auto-tests could be due to the condition it is intended to fix did not show-up during its own testing !! Just pushed http://review.whamcloud.com/8329 to fix this.

Andreas Dilger added a comment - 18/Nov/13 9:31 PM

Patch 8157 has landed on 2013-11-13, but I still see tests failing with ~~LU-4093~~/~~LU-4126~~ in the past few days. Is that just because the test queue is so long that the results we are seeing today are for patches that were based on code not including the fix?

If that can be verified by checking the parent commit of recent test failures does NOT include change 8157, then I guess this bug can be closed. It looks from the results that ~~LU-4093~~ shows up as lustre-hsm passing only 73/91 tests. There are still cases with 90/91 tests passing, so that needs to be a separate bug.

Andreas Dilger added a comment - 18/Nov/13 9:31 PM Patch 8157 has landed on 2013-11-13, but I still see tests failing with LU-4093 / LU-4126 in the past few days. Is that just because the test queue is so long that the results we are seeing today are for patches that were based on code not including the fix? If that can be verified by checking the parent commit of recent test failures does NOT include change 8157, then I guess this bug can be closed. It looks from the results that LU-4093 shows up as lustre-hsm passing only 73/91 tests. There are still cases with 90/91 tests passing, so that needs to be a separate bug.

Bruno Faccini (Inactive) added a comment - 04/Nov/13 3:07 PM

Fix to prevent zombie requests during CT/copytool restart is at http://review.whamcloud.com/8157.

Bruno Faccini (Inactive) added a comment - 04/Nov/13 3:07 PM Fix to prevent zombie requests during CT/copytool restart is at http://review.whamcloud.com/8157 .

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 11/Oct/13 4:18 PM

Updated:: 15/Sep/14 5:09 PM

Resolved:: 10/Dec/13 9:16 AM