Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.8.0
-
autotest
-
3
-
9223372036854775807
Description
sanity-hsm test 12q failed with 'Copytool should have stopped'. From the test log we can see that the archive succeeded and then:
Updated after 2s: wanted 'SUCCEED' got 'SUCCEED' Killing existing copytools on shadow-39vm5 CMD: shadow-39vm5 killall -q lhsmtool_posix CMD: shadow-39vm5 pgrep -x lhsmtool_posix shadow-39vm5: 16199 sanity-hsm test_12q: @@@@@@ FAIL: Copytool should have stopped
It’s not clear what the is causing the problem, but there are differences in the copytool logs from a successful run and this failure. From the copytool log on a failed test, we see that the action is ARCHIVE
1441818057.605721 lhsmtool_posix[16199]: waiting for message from kernel 1441818057.605768 lhsmtool_posix[16200]: '[0x200000402:0x12:0x0]' action ARCHIVE reclen 72, cookie=0x55f065c2 1441818057.609298 lhsmtool_posix[16200]: processing file 'd12q.sanity-hsm/f12q.sanity-hsm' 1441818057.618552 lhsmtool_posix[16200]: archiving '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' to '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' 1441818057.620731 lhsmtool_posix[16200]: saving stripe info of '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' in /tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp.lov 1441818057.624113 lhsmtool_posix[16200]: start copy of 2097152 bytes from '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' to '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' 1441818059.715448 lhsmtool_posix[16200]: copied 2097152 bytes in 2.094523 seconds 1441818059.841860 lhsmtool_posix[16200]: data archiving for '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' to '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' done 1441818059.842153 lhsmtool_posix[16200]: attr file for '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' saved to archive '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' 1441818059.844232 lhsmtool_posix[16200]: fsetxattr of 'trusted.hsm' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=0 (Success) 1441818059.844289 lhsmtool_posix[16200]: fsetxattr of 'trusted.version' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=0 (Success) 1441818059.844356 lhsmtool_posix[16200]: fsetxattr of 'trusted.link' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=0 (Success) 1441818059.844403 lhsmtool_posix[16200]: fsetxattr of 'trusted.lov' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=0 (Success) 1441818059.844442 lhsmtool_posix[16200]: fsetxattr of 'trusted.lma' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=0 (Success) 1441818059.844480 lhsmtool_posix[16200]: fsetxattr of 'lustre.lov' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=-1 (Operation not supported) 1441818059.844493 lhsmtool_posix[16200]: xattr file for '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' saved to archive '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' 1441818059.846873 lhsmtool_posix[16200]: symlink '/tmp/arc1/shadow/d12q.sanity-hsm/f12q.sanity-hsm' to '../../0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0' done 1441818059.848407 lhsmtool_posix[16200]: Action completed, notifying coordinator cookie=0x55f065c2, FID=[0x200000402:0x12:0x0], hp_flags=0 err=0 1441818059.852138 lhsmtool_posix[16200]: llapi_hsm_action_end() on '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' ok (rc=0) exiting: Terminated
On a successful run of test 12q (https://testing.hpdd.intel.com/sub_tests/4d322136-57d1-11e5-a084-5254006e85c2), the copytool logs does not show an ARCHIVE taking place, but only a RESTORE is being done:
1441882443.432895 lhsmtool_posix[5788]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre3 1441882443.441409 lhsmtool_posix[5789]: waiting for message from kernel 1441882446.704674 lhsmtool_posix[5789]: copytool fs=lustre archive#=2 item_count=1 1441882446.704837 lhsmtool_posix[5789]: waiting for message from kernel 1441882446.705275 lhsmtool_posix[5790]: '[0x200000405:0x12:0x0]' action RESTORE reclen 72, cookie=0x55f16107 1441882446.707572 lhsmtool_posix[5790]: processing file 'd12q.sanity-hsm/f12q.sanity-hsm' 1441882446.707649 lhsmtool_posix[5790]: reading stripe rules from '/tmp/arc1/0012/0000/0405/0000/0002/0000/0x200000405:0x12:0x0.lov' for '/tmp/arc1/0012/0000/0405/0000/0002/0000/0x200000405:0x12:0x0' 1441882446.722704 lhsmtool_posix[5790]: restoring data from '/tmp/arc1/0012/0000/0405/0000/0002/0000/0x200000405:0x12:0x0' to '{VOLATILE}=[0x200000408:0x1:0x0]' 1441882446.729598 lhsmtool_posix[5790]: start copy of 2097152 bytes from '/tmp/arc1/0012/0000/0405/0000/0002/0000/0x200000405:0x12:0x0' to '{VOLATILE}=[0x200000408:0x1:0x0]' 1441882448.752649 lhsmtool_posix[5790]: copied 2097152 bytes in 2.026376 seconds 1441882448.752715 lhsmtool_posix[5790]: data restore from '/tmp/arc1/0012/0000/0405/0000/0002/0000/0x200000405:0x12:0x0' to '{VOLATILE}=[0x200000408:0x1:0x0]' done 1441882448.752730 lhsmtool_posix[5790]: Action completed, notifying coordinator cookie=0x55f16107, FID=[0x200000405:0x12:0x0], hp_flags=0 err=0 1441882448.836919 lhsmtool_posix[5790]: llapi_hsm_action_end() on '/mnt/lustre3/.lustre/fid/0x200000405:0x12:0x0' ok (rc=0) exiting: Interrupt
Sanity-hsm test 12q does both an hsm_archive and a hsm_release.
We’ve seen this test fail twice recently. Logs are at
2015-09-09 16:53:45 – review-zfs- part-1 - https://testing.hpdd.intel.com/test_sets/ff615ed0-5727-11e5-84d0-5254006e85c2
2015-09-05 23:31:22 - review-dne-part-2 - https://testing.hpdd.intel.com/test_sets/6c2cd9fe-5453-11e5-b25e-5254006e85c2
Prior to that, sanity-hsm test 12q failed in the same way five times in August. Here is one of those logs:
2015-08-08 03:39:47 - review-dne-part-2 - https://testing.hpdd.intel.com/test_sets/8724f110-3dab-11e5-be57-5254006e85c2
Attachments
Issue Links
- is duplicated by
-
LU-6572 simplify copytool kill and cleanup functions in sanity-hsm
- Resolved
-
LU-7273 sanity-hsm test_107 failed with 'Copytool failed to stop in 20s ...'
- Resolved
-
LU-6771 sanity-hsm test_104: Copytool failed to stop in 20s
- Closed
-
LU-7150 sanity-hsm test_250: Copytool failed to stop in 20s
- Closed
-
LU-7154 sanity-hsm test_12b: Copytool failed to stop in 20s ...
- Closed
-
LU-7341 sanity-hsm test_221: Copytool failed to stop in 20s
- Closed
- is related to
-
LU-7881 sanity-hsm test_26b: @@@@@@ FAIL: Copytool should have stopped
- Resolved