[LU-5779] sanity-hsm test_70: Copytool failed to send unregister event to FIFO Created: 21/Oct/14  Updated: 10/Oct/21  Resolved: 10/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 16218

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/dc915120-590e-11e4-9a49-5254006e85c2.

The sub-test test_70 failed with the following error:

Copytool failed to send unregister event to FIFO

Please provide additional information about the failure here.

Info required for matching: sanity-hsm 70



 Comments   
Comment by Andreas Dilger [ 21/Oct/14 ]

This failed on a b2_5 patch http://review.whamcloud.com/12335

Comment by John Hammond [ 21/Oct/14 ]

The copytool logs from the failures that I found indicate that the SIGINT arrived before the handler had been setup. https://testing.hpdd.intel.com/test_logs/0712f9b2-590f-11e4-9a49-5254006e85c2

lhsmtool_posix[21634]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre

It is not surprising that we have occasional failures here since the signal handling was never sound in the first place.

For comparison here is a more normal looking copytool log:

lhsmtool_posix[21063]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre
lhsmtool_posix[21064]: waiting for message from kernel
lhsmtool_posix[21064]: copytool fs=lustre archive#=2 item_count=1
lhsmtool_posix[21064]: waiting for message from kernel
lhsmtool_posix[21065]: '[0x200000401:0x32:0x0]' action ARCHIVE reclen 72, cookie=0x54462c4b
lhsmtool_posix[21065]: processing file 'd60.sanity-hsm/f60.sanity-hsm'
lhsmtool_posix[21065]: archiving '/mnt/lustre/.lustre/fid/0x200000401:0x32:0x0' to '/home/autotest/.autotest/shared_dir/2014-10-20/162443-70189223321920/arc1/0032/0000/0401/0000/0002/0000/0x200000401:0x32:0x0_tmp'
lhsmtool_posix[21065]: saving stripe info of '/mnt/lustre/.lustre/fid/0x200000401:0x32:0x0' in /home/autotest/.autotest/shared_dir/2014-10-20/162443-70189223321920/arc1/0032/0000/0401/0000/0002/0000/0x200000401:0x32:0x0_tmp.lov
lhsmtool_posix[21065]: going to copy data from '/mnt/lustre/.lustre/fid/0x200000401:0x32:0x0' to '/home/autotest/.autotest/shared_dir/2014-10-20/162443-70189223321920/arc1/0032/0000/0401/0000/0002/0000/0x200000401:0x32:0x0_tmp'
lhsmtool_posix[21065]: bandwith control: excess=2.097152E+06 sleep for 2000000us
lhsmtool_posix[21065]: %13 
exiting: Interrupt
Comment by John Hammond [ 21/Oct/14 ]

Please see http://review.whamcloud.com/12367.

Comment by Andreas Dilger [ 24/Oct/14 ]

Patch landed to master for 2.7.0.

Comment by John Hammond [ 09/Apr/15 ]

Still seeing this from time to time. The test is still racy. We could sleep in the test before calling copytool_cleanup or we could block SIGINT in the CT until registration completes and the handler is setup.

https://testing.hpdd.intel.com/test_sets/11e86a6e-de63-11e4-9121-5254006e85c2.

Generated at Sat Feb 10 01:54:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.