[LU-7136] sanity-hsm test_12q failed with 'Copytool should have stopped' Created: 10/Sep/15 Updated: 28/Mar/17 Resolved: 16/Mar/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | John Hammond |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | hsm | ||
| Environment: |
autotest |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||
| Description |
|
sanity-hsm test 12q failed with 'Copytool should have stopped'. From the test log we can see that the archive succeeded and then: Updated after 2s: wanted 'SUCCEED' got 'SUCCEED' Killing existing copytools on shadow-39vm5 CMD: shadow-39vm5 killall -q lhsmtool_posix CMD: shadow-39vm5 pgrep -x lhsmtool_posix shadow-39vm5: 16199 sanity-hsm test_12q: @@@@@@ FAIL: Copytool should have stopped It’s not clear what the is causing the problem, but there are differences in the copytool logs from a successful run and this failure. From the copytool log on a failed test, we see that the action is ARCHIVE 1441818057.605721 lhsmtool_posix[16199]: waiting for message from kernel 1441818057.605768 lhsmtool_posix[16200]: '[0x200000402:0x12:0x0]' action ARCHIVE reclen 72, cookie=0x55f065c2 1441818057.609298 lhsmtool_posix[16200]: processing file 'd12q.sanity-hsm/f12q.sanity-hsm' 1441818057.618552 lhsmtool_posix[16200]: archiving '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' to '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' 1441818057.620731 lhsmtool_posix[16200]: saving stripe info of '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' in /tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp.lov 1441818057.624113 lhsmtool_posix[16200]: start copy of 2097152 bytes from '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' to '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' 1441818059.715448 lhsmtool_posix[16200]: copied 2097152 bytes in 2.094523 seconds 1441818059.841860 lhsmtool_posix[16200]: data archiving for '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' to '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' done 1441818059.842153 lhsmtool_posix[16200]: attr file for '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' saved to archive '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' 1441818059.844232 lhsmtool_posix[16200]: fsetxattr of 'trusted.hsm' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=0 (Success) 1441818059.844289 lhsmtool_posix[16200]: fsetxattr of 'trusted.version' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=0 (Success) 1441818059.844356 lhsmtool_posix[16200]: fsetxattr of 'trusted.link' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=0 (Success) 1441818059.844403 lhsmtool_posix[16200]: fsetxattr of 'trusted.lov' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=0 (Success) 1441818059.844442 lhsmtool_posix[16200]: fsetxattr of 'trusted.lma' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=0 (Success) 1441818059.844480 lhsmtool_posix[16200]: fsetxattr of 'lustre.lov' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=-1 (Operation not supported) 1441818059.844493 lhsmtool_posix[16200]: xattr file for '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' saved to archive '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' 1441818059.846873 lhsmtool_posix[16200]: symlink '/tmp/arc1/shadow/d12q.sanity-hsm/f12q.sanity-hsm' to '../../0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0' done 1441818059.848407 lhsmtool_posix[16200]: Action completed, notifying coordinator cookie=0x55f065c2, FID=[0x200000402:0x12:0x0], hp_flags=0 err=0 1441818059.852138 lhsmtool_posix[16200]: llapi_hsm_action_end() on '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' ok (rc=0) exiting: Terminated On a successful run of test 12q (https://testing.hpdd.intel.com/sub_tests/4d322136-57d1-11e5-a084-5254006e85c2), the copytool logs does not show an ARCHIVE taking place, but only a RESTORE is being done: 1441882443.432895 lhsmtool_posix[5788]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre3
1441882443.441409 lhsmtool_posix[5789]: waiting for message from kernel
1441882446.704674 lhsmtool_posix[5789]: copytool fs=lustre archive#=2 item_count=1
1441882446.704837 lhsmtool_posix[5789]: waiting for message from kernel
1441882446.705275 lhsmtool_posix[5790]: '[0x200000405:0x12:0x0]' action RESTORE reclen 72, cookie=0x55f16107
1441882446.707572 lhsmtool_posix[5790]: processing file 'd12q.sanity-hsm/f12q.sanity-hsm'
1441882446.707649 lhsmtool_posix[5790]: reading stripe rules from '/tmp/arc1/0012/0000/0405/0000/0002/0000/0x200000405:0x12:0x0.lov' for '/tmp/arc1/0012/0000/0405/0000/0002/0000/0x200000405:0x12:0x0'
1441882446.722704 lhsmtool_posix[5790]: restoring data from '/tmp/arc1/0012/0000/0405/0000/0002/0000/0x200000405:0x12:0x0' to '{VOLATILE}=[0x200000408:0x1:0x0]'
1441882446.729598 lhsmtool_posix[5790]: start copy of 2097152 bytes from '/tmp/arc1/0012/0000/0405/0000/0002/0000/0x200000405:0x12:0x0' to '{VOLATILE}=[0x200000408:0x1:0x0]'
1441882448.752649 lhsmtool_posix[5790]: copied 2097152 bytes in 2.026376 seconds
1441882448.752715 lhsmtool_posix[5790]: data restore from '/tmp/arc1/0012/0000/0405/0000/0002/0000/0x200000405:0x12:0x0' to '{VOLATILE}=[0x200000408:0x1:0x0]' done
1441882448.752730 lhsmtool_posix[5790]: Action completed, notifying coordinator cookie=0x55f16107, FID=[0x200000405:0x12:0x0], hp_flags=0 err=0
1441882448.836919 lhsmtool_posix[5790]: llapi_hsm_action_end() on '/mnt/lustre3/.lustre/fid/0x200000405:0x12:0x0' ok (rc=0)
exiting: Interrupt
Sanity-hsm test 12q does both an hsm_archive and a hsm_release. We’ve seen this test fail twice recently. Logs are at Prior to that, sanity-hsm test 12q failed in the same way five times in August. Here is one of those logs: |
| Comments |
| Comment by Aditya Pandit (Inactive) [ 15/Sep/15 ] |
|
I debugged a similar issue. What was happening was that pgrep -x was not returning proper result and was returning 1. This triggers the trap which again tries to kill the process. Also there is while loop with done < <(echo $"$(get_copytool_event_log)"). This also caused a problem while running the test. I replaced both of them and the tests executed properly. Can you please also tell the version of bash and the distribution you are using here?, as once I had got an error due to incorrect bash version. |
| Comment by James Nunez (Inactive) [ 27/Oct/15 ] |
|
Sorry for the delay and thank you for the comments. On the client, I can see that it was running bash-4.1.2-33.el6.x86_64. The nodes are running CentOS release 6.6 or 6.7. Another instance where nodes are running bash-4.1.2-33.el6_7.1.x86_64 and CentOS release 6.7. Logs at https://testing.hpdd.intel.com/test_sets/91d54f9a-7c6e-11e5-b827-5254006e85c2 |
| Comment by Jian Yu [ 04/Nov/15 ] |
|
More instance on master branch: |
| Comment by Bob Glossman (Inactive) [ 10/Nov/15 ] |
|
another on master: |
| Comment by nasf (Inactive) [ 02/Dec/15 ] |
|
Another failure instance on master: |
| Comment by Gerrit Updater [ 07/Dec/15 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/17499 |
| Comment by James Nunez (Inactive) [ 09/Dec/15 ] |
|
More failures on master: |
| Comment by John Hammond [ 11/Dec/15 ] |
|
This is due to NFS being slow on the test clusters. When the copytool is signaled to exit it writes to and closes its log (which is on NFS). Looking at the copytool logs and test logs shows that it is exiting properly just not in 20 or in 5 seconds. |
| Comment by Gerrit Updater [ 11/Dec/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17499/ |
| Comment by Joseph Gmitter (Inactive) [ 16/Mar/16 ] |
|
Ticket cleanup: The patch had landed to master for the 2.8.0 release in Dec '15 and no further patches/work needed on this ticket. |