Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.5.0
-
3
-
10999
Description
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>
This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/04c602fa-3258-11e3-9de6-52540035b04c.
The sub-test test_24d failed with the following error:
Cannot send HSM request (use of /mnt/lustre2/d0.sanity-hsm/d24/f.sanity-hsm.24d): Read-only file system
:
:
CMD: wtm-27vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200008101:0x1f:0x0'.*action='ARCHIVE'/ { print $13 }' | cut -f2 -d=
Update not seen after 100s: wanted 'SUCCEED' got 'WAITING'
sanity-hsm test_24d: @@@@@@ FAIL: request on 0x200008101:0x1f:0x0 is not SUCCEED
Info required for matching: sanity-hsm 24d
Hello Jian, humm looks different I think, at least because in all cases you noticed the copytool_log is not present for test_24d !
And if I look precisely into the logs/msgs, "Wakeup copytool agt1 on ..." should not be printed after a successfull copytool_cleanup, from previous sub-test, and copytool_setup, within new sub-test, in sequence because the implied "pkill" command should have fail if no copytool thread from previous sub-test can be found after its stop/kill in copytool_cleanup.
So a possible scenario is that some copytool thread from previous test_24c sub-test must be somewhat stuck and still not terminated at the time test_24d tries to restart copytool ... But finally dies, and in fact we may finally run test_24d with no copytool started !
So this should be addressed in a new ticket, and the fix should be to ensure full death of ALL copytool threads in copytool_cleanup().
If you agree, I will create ticket, assign to myself, add you in watchers list, and push a fix soon.