[LU-4093] sanity-hsm test_24d: wanted 'SUCCEED' got 'WAITING' - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1
Affects Version/s: Lustre 2.5.0
Labels:
- HSM
- zfs

Severity:
3
Rank (Obsolete):
10999

Description

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/04c602fa-3258-11e3-9de6-52540035b04c.

The sub-test test_24d failed with the following error:

Cannot send HSM request (use of /mnt/lustre2/d0.sanity-hsm/d24/f.sanity-hsm.24d): Read-only file system
:
:
CMD: wtm-27vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200008101:0x1f:0x0'.*action='ARCHIVE'/ { print $13 }' | cut -f2 -d=
Update not seen after 100s: wanted 'SUCCEED' got 'WAITING'
sanity-hsm test_24d: @@@@@@ FAIL: request on 0x200008101:0x1f:0x0 is not SUCCEED

Info required for matching: sanity-hsm 24d

Attachments

Issue Links

is duplicated by

LU-4235 Test failure on test suite sanity-hsm, subtest test_24d

Resolved

is related to

LU-4126 sanity-hsm test_15 failure: 'requests did not complete'

Resolved

Activity

[LU-4093] sanity-hsm test_24d: wanted 'SUCCEED' got 'WAITING'

Jian Yu added a comment - 15/Sep/14 5:09 PM

Also, I think we need to change the link of the failures you listed from ~~LU-4093~~ to ~~LU-5622~~, what do you think ?

Agreed. Just changed. Thank you, Bruno.

Jian Yu added a comment - 15/Sep/14 5:09 PM Also, I think we need to change the link of the failures you listed from LU-4093 to LU-5622 , what do you think ? Agreed. Just changed. Thank you, Bruno.

Bruno Faccini (Inactive) added a comment - 15/Sep/14 3:28 PM

Just created ~~LU-5622~~ to address this new problem.

Just want to add that the fact it seems to occur more frequently between test_24c and test_24d sub-tests switch may come from specific/enhanced cleanup actions in test_24c ...

Also, I think we need to change the link of the failures you listed from ~~LU-4093~~ to ~~LU-5622~~, what do you think ?

Bruno Faccini (Inactive) added a comment - 15/Sep/14 3:28 PM Just created LU-5622 to address this new problem. Just want to add that the fact it seems to occur more frequently between test_24c and test_24d sub-tests switch may come from specific/enhanced cleanup actions in test_24c ... Also, I think we need to change the link of the failures you listed from LU-4093 to LU-5622 , what do you think ?

Jian Yu added a comment - 14/Sep/14 6:30 AM

If you agree, I will create ticket, assign to myself, add you in watchers list, and push a fix soon.

Thank you, Bruno. Please do.

Jian Yu added a comment - 14/Sep/14 6:30 AM If you agree, I will create ticket, assign to myself, add you in watchers list, and push a fix soon. Thank you, Bruno. Please do.

Bruno Faccini (Inactive) added a comment - 13/Sep/14 2:53 PM

Hello Jian, humm looks different I think, at least because in all cases you noticed the copytool_log is not present for test_24d !

And if I look precisely into the logs/msgs, "Wakeup copytool agt1 on ..." should not be printed after a successfull copytool_cleanup, from previous sub-test, and copytool_setup, within new sub-test, in sequence because the implied "pkill" command should have fail if no copytool thread from previous sub-test can be found after its stop/kill in copytool_cleanup.
So a possible scenario is that some copytool thread from previous test_24c sub-test must be somewhat stuck and still not terminated at the time test_24d tries to restart copytool ... But finally dies, and in fact we may finally run test_24d with no copytool started !

So this should be addressed in a new ticket, and the fix should be to ensure full death of ALL copytool threads in copytool_cleanup().
If you agree, I will create ticket, assign to myself, add you in watchers list, and push a fix soon.

Bruno Faccini (Inactive) added a comment - 13/Sep/14 2:53 PM Hello Jian, humm looks different I think, at least because in all cases you noticed the copytool_log is not present for test_24d ! And if I look precisely into the logs/msgs, "Wakeup copytool agt1 on ..." should not be printed after a successfull copytool_cleanup, from previous sub-test, and copytool_setup, within new sub-test, in sequence because the implied "pkill" command should have fail if no copytool thread from previous sub-test can be found after its stop/kill in copytool_cleanup. So a possible scenario is that some copytool thread from previous test_24c sub-test must be somewhat stuck and still not terminated at the time test_24d tries to restart copytool ... But finally dies, and in fact we may finally run test_24d with no copytool started ! So this should be addressed in a new ticket, and the fix should be to ensure full death of ALL copytool threads in copytool_cleanup(). If you agree, I will create ticket, assign to myself, add you in watchers list, and push a fix soon.

James Nunez (Inactive) added a comment - 23/Apr/14 4:07 PM

It looks like I just hit this issue with b2_5 in review-zfs at https://maloo.whamcloud.com/test_sets/e11b4944-c822-11e3-888b-52540035b04c

James Nunez (Inactive) added a comment - 23/Apr/14 4:07 PM It looks like I just hit this issue with b2_5 in review-zfs at https://maloo.whamcloud.com/test_sets/e11b4944-c822-11e3-888b-52540035b04c

Bruno Faccini (Inactive) added a comment - 10/Dec/13 9:16 AM

No more related failures for master builds reported since Nov 18th. So as expected the 2 changes for ~~LU-4093~~ made it.

Bruno Faccini (Inactive) added a comment - 10/Dec/13 9:16 AM No more related failures for master builds reported since Nov 18th. So as expected the 2 changes for LU-4093 made it.

Bruno Faccini (Inactive) added a comment - 28/Nov/13 1:45 PM

http://review.whamcloud.com/8329 has just landed, so need to wait a week or 2 to verify its effects and close.

Bruno Faccini (Inactive) added a comment - 28/Nov/13 1:45 PM http://review.whamcloud.com/8329 has just landed, so need to wait a week or 2 to verify its effects and close.

Bruno Faccini (Inactive) added a comment - 19/Nov/13 11:19 AM

Humm thanks Andreas to chase this, seems that original patch #8157 (patch-set #3) has a typo (MDT_HSMCTRL vs mdt_hsmctrl) and an inverted test ([echo $oldstate | grep stop || continue] vs [echo $oldstate | grep stop && continue]) that should prevent it to work as expected and rather with reverted logic (only stopped CDT will be stopped) …
The fact itself passed auto-tests could be due to the condition it is intended to fix did not show-up during its own testing !!
Just pushed http://review.whamcloud.com/8329 to fix this.

Bruno Faccini (Inactive) added a comment - 19/Nov/13 11:19 AM Humm thanks Andreas to chase this, seems that original patch #8157 (patch-set #3) has a typo (MDT_HSMCTRL vs mdt_hsmctrl) and an inverted test ( [echo $oldstate | grep stop || continue] vs [echo $oldstate | grep stop && continue] ) that should prevent it to work as expected and rather with reverted logic (only stopped CDT will be stopped) … The fact itself passed auto-tests could be due to the condition it is intended to fix did not show-up during its own testing !! Just pushed http://review.whamcloud.com/8329 to fix this.

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 11/Oct/13 4:18 PM

Updated:: 15/Sep/14 5:09 PM

Resolved:: 10/Dec/13 9:16 AM