[LU-5779] sanity-hsm test_70: Copytool failed to send unregister event to FIFO Created: 21/Oct/14 Updated: 10/Oct/21 Resolved: 10/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 16218 | ||||
| Description |
|
This issue was created by maloo for Bob Glossman <bob.glossman@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/dc915120-590e-11e4-9a49-5254006e85c2. The sub-test test_70 failed with the following error: Copytool failed to send unregister event to FIFO Please provide additional information about the failure here. Info required for matching: sanity-hsm 70 |
| Comments |
| Comment by Andreas Dilger [ 21/Oct/14 ] |
|
This failed on a b2_5 patch http://review.whamcloud.com/12335 |
| Comment by John Hammond [ 21/Oct/14 ] |
|
The copytool logs from the failures that I found indicate that the SIGINT arrived before the handler had been setup. https://testing.hpdd.intel.com/test_logs/0712f9b2-590f-11e4-9a49-5254006e85c2 lhsmtool_posix[21634]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre It is not surprising that we have occasional failures here since the signal handling was never sound in the first place. For comparison here is a more normal looking copytool log: lhsmtool_posix[21063]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre lhsmtool_posix[21064]: waiting for message from kernel lhsmtool_posix[21064]: copytool fs=lustre archive#=2 item_count=1 lhsmtool_posix[21064]: waiting for message from kernel lhsmtool_posix[21065]: '[0x200000401:0x32:0x0]' action ARCHIVE reclen 72, cookie=0x54462c4b lhsmtool_posix[21065]: processing file 'd60.sanity-hsm/f60.sanity-hsm' lhsmtool_posix[21065]: archiving '/mnt/lustre/.lustre/fid/0x200000401:0x32:0x0' to '/home/autotest/.autotest/shared_dir/2014-10-20/162443-70189223321920/arc1/0032/0000/0401/0000/0002/0000/0x200000401:0x32:0x0_tmp' lhsmtool_posix[21065]: saving stripe info of '/mnt/lustre/.lustre/fid/0x200000401:0x32:0x0' in /home/autotest/.autotest/shared_dir/2014-10-20/162443-70189223321920/arc1/0032/0000/0401/0000/0002/0000/0x200000401:0x32:0x0_tmp.lov lhsmtool_posix[21065]: going to copy data from '/mnt/lustre/.lustre/fid/0x200000401:0x32:0x0' to '/home/autotest/.autotest/shared_dir/2014-10-20/162443-70189223321920/arc1/0032/0000/0401/0000/0002/0000/0x200000401:0x32:0x0_tmp' lhsmtool_posix[21065]: bandwith control: excess=2.097152E+06 sleep for 2000000us lhsmtool_posix[21065]: %13 exiting: Interrupt |
| Comment by John Hammond [ 21/Oct/14 ] |
|
Please see http://review.whamcloud.com/12367. |
| Comment by Andreas Dilger [ 24/Oct/14 ] |
|
Patch landed to master for 2.7.0. |
| Comment by John Hammond [ 09/Apr/15 ] |
|
Still seeing this from time to time. The test is still racy. We could sleep in the test before calling copytool_cleanup or we could block SIGINT in the CT until registration completes and the handler is setup. https://testing.hpdd.intel.com/test_sets/11e86a6e-de63-11e4-9121-5254006e85c2. |