[LU-7136] sanity-hsm test_12q failed with 'Copytool should have stopped' Created: 10/Sep/15  Updated: 28/Mar/17  Resolved: 16/Mar/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: hsm
Environment:

autotest


Issue Links:
Duplicate
is duplicated by LU-6572 simplify copytool kill and cleanup fu... Resolved
is duplicated by LU-7273 sanity-hsm test_107 failed with 'Copy... Resolved
is duplicated by LU-6771 sanity-hsm test_104: Copytool failed ... Closed
is duplicated by LU-7150 sanity-hsm test_250: Copytool failed ... Closed
is duplicated by LU-7154 sanity-hsm test_12b: Copytool failed ... Closed
is duplicated by LU-7341 sanity-hsm test_221: Copytool failed ... Closed
Related
is related to LU-7881 sanity-hsm test_26b: @@@@@@ FAIL: Co... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-hsm test 12q failed with 'Copytool should have stopped'. From the test log we can see that the archive succeeded and then:

Updated after 2s: wanted 'SUCCEED' got 'SUCCEED'
Killing existing copytools on shadow-39vm5
CMD: shadow-39vm5 killall -q lhsmtool_posix
CMD: shadow-39vm5 pgrep -x lhsmtool_posix
shadow-39vm5: 16199
 sanity-hsm test_12q: @@@@@@ FAIL: Copytool should have stopped 

It’s not clear what the is causing the problem, but there are differences in the copytool logs from a successful run and this failure. From the copytool log on a failed test, we see that the action is ARCHIVE

1441818057.605721 lhsmtool_posix[16199]: waiting for message from kernel
1441818057.605768 lhsmtool_posix[16200]: '[0x200000402:0x12:0x0]' action ARCHIVE reclen 72, cookie=0x55f065c2
1441818057.609298 lhsmtool_posix[16200]: processing file 'd12q.sanity-hsm/f12q.sanity-hsm'
1441818057.618552 lhsmtool_posix[16200]: archiving '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' to '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp'
1441818057.620731 lhsmtool_posix[16200]: saving stripe info of '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' in /tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp.lov
1441818057.624113 lhsmtool_posix[16200]: start copy of 2097152 bytes from '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' to '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp'
1441818059.715448 lhsmtool_posix[16200]: copied 2097152 bytes in 2.094523 seconds
1441818059.841860 lhsmtool_posix[16200]: data archiving for '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' to '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' done
1441818059.842153 lhsmtool_posix[16200]: attr file for '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' saved to archive '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp'
1441818059.844232 lhsmtool_posix[16200]: fsetxattr of 'trusted.hsm' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=0 (Success)
1441818059.844289 lhsmtool_posix[16200]: fsetxattr of 'trusted.version' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=0 (Success)
1441818059.844356 lhsmtool_posix[16200]: fsetxattr of 'trusted.link' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=0 (Success)
1441818059.844403 lhsmtool_posix[16200]: fsetxattr of 'trusted.lov' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=0 (Success)
1441818059.844442 lhsmtool_posix[16200]: fsetxattr of 'trusted.lma' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=0 (Success)
1441818059.844480 lhsmtool_posix[16200]: fsetxattr of 'lustre.lov' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=-1 (Operation not supported)
1441818059.844493 lhsmtool_posix[16200]: xattr file for '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' saved to archive '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp'
1441818059.846873 lhsmtool_posix[16200]: symlink '/tmp/arc1/shadow/d12q.sanity-hsm/f12q.sanity-hsm' to '../../0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0' done
1441818059.848407 lhsmtool_posix[16200]: Action completed, notifying coordinator cookie=0x55f065c2, FID=[0x200000402:0x12:0x0], hp_flags=0 err=0
1441818059.852138 lhsmtool_posix[16200]: llapi_hsm_action_end() on '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' ok (rc=0)
exiting: Terminated

On a successful run of test 12q (https://testing.hpdd.intel.com/sub_tests/4d322136-57d1-11e5-a084-5254006e85c2), the copytool logs does not show an ARCHIVE taking place, but only a RESTORE is being done:

1441882443.432895 lhsmtool_posix[5788]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre3
1441882443.441409 lhsmtool_posix[5789]: waiting for message from kernel
1441882446.704674 lhsmtool_posix[5789]: copytool fs=lustre archive#=2 item_count=1
1441882446.704837 lhsmtool_posix[5789]: waiting for message from kernel
1441882446.705275 lhsmtool_posix[5790]: '[0x200000405:0x12:0x0]' action RESTORE reclen 72, cookie=0x55f16107
1441882446.707572 lhsmtool_posix[5790]: processing file 'd12q.sanity-hsm/f12q.sanity-hsm'
1441882446.707649 lhsmtool_posix[5790]: reading stripe rules from '/tmp/arc1/0012/0000/0405/0000/0002/0000/0x200000405:0x12:0x0.lov' for '/tmp/arc1/0012/0000/0405/0000/0002/0000/0x200000405:0x12:0x0'
1441882446.722704 lhsmtool_posix[5790]: restoring data from '/tmp/arc1/0012/0000/0405/0000/0002/0000/0x200000405:0x12:0x0' to '{VOLATILE}=[0x200000408:0x1:0x0]'
1441882446.729598 lhsmtool_posix[5790]: start copy of 2097152 bytes from '/tmp/arc1/0012/0000/0405/0000/0002/0000/0x200000405:0x12:0x0' to '{VOLATILE}=[0x200000408:0x1:0x0]'
1441882448.752649 lhsmtool_posix[5790]: copied 2097152 bytes in 2.026376 seconds
1441882448.752715 lhsmtool_posix[5790]: data restore from '/tmp/arc1/0012/0000/0405/0000/0002/0000/0x200000405:0x12:0x0' to '{VOLATILE}=[0x200000408:0x1:0x0]' done
1441882448.752730 lhsmtool_posix[5790]: Action completed, notifying coordinator cookie=0x55f16107, FID=[0x200000405:0x12:0x0], hp_flags=0 err=0
1441882448.836919 lhsmtool_posix[5790]: llapi_hsm_action_end() on '/mnt/lustre3/.lustre/fid/0x200000405:0x12:0x0' ok (rc=0)
exiting: Interrupt

Sanity-hsm test 12q does both an hsm_archive and a hsm_release.

We’ve seen this test fail twice recently. Logs are at
2015-09-09 16:53:45 – review-zfs- part-1 - https://testing.hpdd.intel.com/test_sets/ff615ed0-5727-11e5-84d0-5254006e85c2
2015-09-05 23:31:22 - review-dne-part-2 - https://testing.hpdd.intel.com/test_sets/6c2cd9fe-5453-11e5-b25e-5254006e85c2

Prior to that, sanity-hsm test 12q failed in the same way five times in August. Here is one of those logs:
2015-08-08 03:39:47 - review-dne-part-2 - https://testing.hpdd.intel.com/test_sets/8724f110-3dab-11e5-be57-5254006e85c2



 Comments   
Comment by Aditya Pandit (Inactive) [ 15/Sep/15 ]

I debugged a similar issue. What was happening was that pgrep -x was not returning proper result and was returning 1. This triggers the trap which again tries to kill the process. Also there is while loop with done < <(echo $"$(get_copytool_event_log)"). This also caused a problem while running the test. I replaced both of them and the tests executed properly. Can you please also tell the version of bash and the distribution you are using here?, as once I had got an error due to incorrect bash version.

Comment by James Nunez (Inactive) [ 27/Oct/15 ]

Sorry for the delay and thank you for the comments. On the client, I can see that it was running bash-4.1.2-33.el6.x86_64. The nodes are running CentOS release 6.6 or 6.7.

Another instance where nodes are running bash-4.1.2-33.el6_7.1.x86_64 and CentOS release 6.7. Logs at https://testing.hpdd.intel.com/test_sets/91d54f9a-7c6e-11e5-b827-5254006e85c2

Comment by Jian Yu [ 04/Nov/15 ]

More instance on master branch:
https://testing.hpdd.intel.com/test_sets/fa63b68c-8321-11e5-93d9-5254006e85c2

Comment by Bob Glossman (Inactive) [ 10/Nov/15 ]

another on master:
https://testing.hpdd.intel.com/test_sets/760a5892-8799-11e5-b56a-5254006e85c2

Comment by nasf (Inactive) [ 02/Dec/15 ]

Another failure instance on master:
https://testing.hpdd.intel.com/test_sets/0440dcbe-98d7-11e5-8079-5254006e85c2

Comment by Gerrit Updater [ 07/Dec/15 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/17499
Subject: LU-7136 test: allow more time for copytools to stop
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0f60d4aa0d78a9b8e4535a0a462b3ebcf21248f3

Comment by James Nunez (Inactive) [ 09/Dec/15 ]

More failures on master:
2015-12-08 19:57:12 - https://testing.hpdd.intel.com/test_sets/b537a4fc-9e0f-11e5-8ea0-5254006e85c2
2015-12-08 22:10:24 - https://testing.hpdd.intel.com/test_sets/974cee50-9e25-11e5-b163-5254006e85c2
2015-12-09 01:35:26 - https://testing.hpdd.intel.com/test_sets/8126dbac-9e22-11e5-8ea0-5254006e85c2
2015-12-09 08:03:50 - https://testing.hpdd.intel.com/test_sets/30d0eef4-9e7c-11e5-87a9-5254006e85c2
2015-12-10 01:53:33 - https://testing.hpdd.intel.com/test_sets/e96acb88-9eef-11e5-ba94-5254006e85c2

Comment by John Hammond [ 11/Dec/15 ]

This is due to NFS being slow on the test clusters. When the copytool is signaled to exit it writes to and closes its log (which is on NFS). Looking at the copytool logs and test logs shows that it is exiting properly just not in 20 or in 5 seconds.

Comment by Gerrit Updater [ 11/Dec/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17499/
Subject: LU-7136 test: allow more time for copytools to stop
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f85eaefe2c668a962ca0980fd827b0c5f11f99ea

Comment by Joseph Gmitter (Inactive) [ 16/Mar/16 ]

Ticket cleanup: The patch had landed to master for the 2.8.0 release in Dec '15 and no further patches/work needed on this ticket.

Generated at Sat Feb 10 02:06:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.