[LU-4126] sanity-hsm test_15 failure: 'requests did not complete' Created: 18/Oct/13  Updated: 19/Mar/14  Resolved: 19/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Duplicate Votes: 0
Labels: HSM
Environment:

Lustre 2.5.0-RC1, el6

OpenSFS cluster with combined MGS/MDS (c03), single OSS (c04) with two OSTs, archive MGS/MDS (c05), archive OST (c06) with two OSTs, archive OST2 (c07) with two OSTs, eight clients; one agent + client(c08), one robinhood/db + client(c09) and others just running as Lustre clients (c10, c11, c12, c13,c14, c15)


Issue Links:
Related
is related to LU-4093 sanity-hsm test_24d: wanted 'SUCCEED'... Resolved
is related to LU-3939 Test failure on test suite sanity-hsm... Resolved
Severity: 3
Rank (Obsolete): 11135

 Description   

The test results are at: https://maloo.whamcloud.com/test_sessions/af9dfc14-3834-11e3-8bc4-52540035b04c

From the test_log:

== sanity-hsm test 15: Rebind a list of files == 07:44:55 (1382107495)
pdsh@c15: c08: ssh exited with exit code 1
Purging archive on c08
Starting copytool agt1 on c08
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.315725 s, 6.6 MB/s
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.310107 s, 6.8 MB/s
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.305355 s, 6.9 MB/s
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.3105 s, 6.8 MB/s
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.311318 s, 6.7 MB/s
Changed after 0s: from '' to 'lrh=[type=10680000 len=136 idx=3/28] fid=[0x20000040b:0x74:0x0] dfid=[0x20000040b:0x74:0x0] compound/cookie=0x5260275d/0x5260275d action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=3/106] fid=[0x200000408:0x230:0x0] dfid=[0x200000408:0x230:0x0] compound/cookie=0x5260277d/0x52602777 action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=3/109] fid=[0x200000408:0x231:0x0] dfid=[0x200000408:0x231:0x0] compound/cookie=0x5260277e/0x52602778 action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=3/110] fid=[0x200000408:0x232:0x0] dfid=[0x200000408:0x232:0x0] compound/cookie=0x5260277f/0x52602779 action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[]
lrh=[type=10680000 len=136 idx=3/111] fid=[0x200000408:0x233:0x0] dfid=[0x200000408:0x233:0x0] compound/cookie=0x52602780/0x5260277a action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[]'
Waiting 300 secs for update
Changed after 4s: from 'lrh=[type=10680000 len=136 idx=3/28] fid=[0x20000040b:0x74:0x0] dfid=[0x20000040b:0x74:0x0] compound/cookie=0x5260275d/0x5260275d action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=3/106] fid=[0x200000408:0x230:0x0] dfid=[0x200000408:0x230:0x0] compound/cookie=0x5260277d/0x52602777 action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=3/109] fid=[0x200000408:0x231:0x0] dfid=[0x200000408:0x231:0x0] compound/cookie=0x5260277e/0x52602778 action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=3/110] fid=[0x200000408:0x232:0x0] dfid=[0x200000408:0x232:0x0] compound/cookie=0x5260277f/0x52602779 action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[]
lrh=[type=10680000 len=136 idx=3/111] fid=[0x200000408:0x233:0x0] dfid=[0x200000408:0x233:0x0] compound/cookie=0x52602780/0x5260277a action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[]' to 'lrh=[type=10680000 len=136 idx=3/28] fid=[0x20000040b:0x74:0x0] dfid=[0x20000040b:0x74:0x0] compound/cookie=0x5260275d/0x5260275d action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=3/113] fid=[0x200000408:0x233:0x0] dfid=[0x200000408:0x233:0x0] compound/cookie=0x52602780/0x5260277a action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=3/115] fid=[0x200000408:0x232:0x0] dfid=[0x200000408:0x232:0x0] compound/cookie=0x5260277f/0x52602779 action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]'
Changed after 7s: from 'lrh=[type=10680000 len=136 idx=3/28] fid=[0x20000040b:0x74:0x0] dfid=[0x20000040b:0x74:0x0] compound/cookie=0x5260275d/0x5260275d action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=3/113] fid=[0x200000408:0x233:0x0] dfid=[0x200000408:0x233:0x0] compound/cookie=0x52602780/0x5260277a action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=3/115] fid=[0x200000408:0x232:0x0] dfid=[0x200000408:0x232:0x0] compound/cookie=0x5260277f/0x52602779 action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]' to 'lrh=[type=10680000 len=136 idx=3/28] fid=[0x20000040b:0x74:0x0] dfid=[0x20000040b:0x74:0x0] compound/cookie=0x5260275d/0x5260275d action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=3/115] fid=[0x200000408:0x232:0x0] dfid=[0x200000408:0x232:0x0] compound/cookie=0x5260277f/0x52602779 action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]'
Changed after 8s: from 'lrh=[type=10680000 len=136 idx=3/28] fid=[0x20000040b:0x74:0x0] dfid=[0x20000040b:0x74:0x0] compound/cookie=0x5260275d/0x5260275d action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=3/115] fid=[0x200000408:0x232:0x0] dfid=[0x200000408:0x232:0x0] compound/cookie=0x5260277f/0x52602779 action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]' to 'lrh=[type=10680000 len=136 idx=3/28] fid=[0x20000040b:0x74:0x0] dfid=[0x20000040b:0x74:0x0] compound/cookie=0x5260275d/0x5260275d action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]'
Waiting 290 secs for update
Waiting 280 secs for update
Waiting 270 secs for update
Waiting 260 secs for update
Waiting 250 secs for update
Waiting 240 secs for update
Waiting 230 secs for update
Waiting 220 secs for update
Waiting 210 secs for update
Waiting 200 secs for update
Waiting 190 secs for update
Waiting 180 secs for update
Waiting 170 secs for update
Waiting 160 secs for update
Waiting 150 secs for update
Waiting 140 secs for update
Waiting 130 secs for update
Waiting 120 secs for update
Waiting 110 secs for update
Waiting 100 secs for update
Waiting 90 secs for update
Waiting 80 secs for update
Waiting 70 secs for update
Waiting 60 secs for update
Waiting 50 secs for update
Waiting 40 secs for update
Waiting 30 secs for update
Waiting 20 secs for update
Waiting 10 secs for update
Update not seen after 300s: wanted '' got 'lrh=[type=10680000 len=136 idx=3/28] fid=[0x20000040b:0x74:0x0] dfid=[0x20000040b:0x74:0x0] compound/cookie=0x5260275d/0x5260275d action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]'
 sanity-hsm test_15: @@@@@@ FAIL: requests did not complete 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit()
  = /usr/lib64/lustre/tests/test-framework.sh:4291:error()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:500:wait_all_done()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:1161:test_15()
  = /usr/lib64/lustre/tests/test-framework.sh:4530:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:4563:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4433:run_test()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:1195:main()
Dumping lctl log to /tmp/test_logs/2013-10-18/074316/sanity-hsm.test_15.*.1382107848.log
Copytool is stopped on c08

The copytool_log for this test is below, but does not contain the last fid (0x20000040b:0x74:0x0) listed in the test_log.

lhsmtool_posix[7609]: action=0 src=(null) dst=(null) mount_point=/lustre/scratch
lhsmtool_posix[7610]: waiting for message from kernel
lhsmtool_posix[7610]: copytool fs=scratch archive#=3 item_count=1
lhsmtool_posix[7610]: waiting for message from kernel
lhsmtool_posix[7611]: '[0x200000408:0x22f:0x0]' action ARCHIVE reclen 72, cookie=0x52602776
lhsmtool_posix[7611]: processing file 'd0.sanity-hsm/d15/f.sanity-hsm.15.1'
lhsmtool_posix[7611]: archiving '/lustre/scratch/.lustre/fid/0x200000408:0x22f:0x0' to '/lustre/archive/022f/0000/0408/0000/0002/0000/0x200000408:0x22f:0x0_tmp'
lhsmtool_posix[7611]: saving stripe info of '/lustre/scratch/.lustre/fid/0x200000408:0x22f:0x0' in /lustre/archive/022f/0000/0408/0000/0002/0000/0x200000408:0x22f:0x0_tmp.lov
lhsmtool_posix[7611]: going to copy data from '/lustre/scratch/.lustre/fid/0x200000408:0x22f:0x0' to '/lustre/archive/022f/0000/0408/0000/0002/0000/0x200000408:0x22f:0x0_tmp'
lhsmtool_posix[7611]: Going to copy 2097152 bytes /lustre/scratch/.lustre/fid/0x200000408:0x22f:0x0 -> /lustre/archive/022f/0000/0408/0000/0002/0000/0x200000408:0x22f:0x0_tmp

lhsmtool_posix[7610]: copytool fs=scratch archive#=3 item_count=1
lhsmtool_posix[7610]: waiting for message from kernel
lhsmtool_posix[7612]: '[0x200000408:0x230:0x0]' action ARCHIVE reclen 72, cookie=0x52602777
lhsmtool_posix[7612]: processing file 'd0.sanity-hsm/d15/f.sanity-hsm.15.2'
lhsmtool_posix[7612]: archiving '/lustre/scratch/.lustre/fid/0x200000408:0x230:0x0' to '/lustre/archive/0230/0000/0408/0000/0002/0000/0x200000408:0x230:0x0_tmp'
lhsmtool_posix[7612]: saving stripe info of '/lustre/scratch/.lustre/fid/0x200000408:0x230:0x0' in /lustre/archive/0230/0000/0408/0000/0002/0000/0x200000408:0x230:0x0_tmp.lov
lhsmtool_posix[7612]: going to copy data from '/lustre/scratch/.lustre/fid/0x200000408:0x230:0x0' to '/lustre/archive/0230/0000/0408/0000/0002/0000/0x200000408:0x230:0x0_tmp'
lhsmtool_posix[7612]: Going to copy 2097152 bytes /lustre/scratch/.lustre/fid/0x200000408:0x230:0x0 -> /lustre/archive/0230/0000/0408/0000/0002/0000/0x200000408:0x230:0x0_tmp

lhsmtool_posix[7611]: data archiving for '/lustre/scratch/.lustre/fid/0x200000408:0x22f:0x0' to '/lustre/archive/022f/0000/0408/0000/0002/0000/0x200000408:0x22f:0x0_tmp' done
lhsmtool_posix[7611]: attr file for '/lustre/scratch/.lustre/fid/0x200000408:0x22f:0x0' saved to archive '/lustre/archive/022f/0000/0408/0000/0002/0000/0x200000408:0x22f:0x0_tmp'
lhsmtool_posix[7611]: fsetxattr of 'trusted.hsm' on '/lustre/archive/022f/0000/0408/0000/0002/0000/0x200000408:0x22f:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7611]: fsetxattr of 'trusted.link' on '/lustre/archive/022f/0000/0408/0000/0002/0000/0x200000408:0x22f:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7611]: fsetxattr of 'trusted.lov' on '/lustre/archive/022f/0000/0408/0000/0002/0000/0x200000408:0x22f:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7611]: fsetxattr of 'trusted.lma' on '/lustre/archive/022f/0000/0408/0000/0002/0000/0x200000408:0x22f:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7611]: fsetxattr of 'lustre.lov' on '/lustre/archive/022f/0000/0408/0000/0002/0000/0x200000408:0x22f:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7611]: xattr file for '/lustre/scratch/.lustre/fid/0x200000408:0x22f:0x0' saved to archive '/lustre/archive/022f/0000/0408/0000/0002/0000/0x200000408:0x22f:0x0_tmp'
lhsmtool_posix[7611]: symlink '/lustre/archive/shadow/d0.sanity-hsm/d15/f.sanity-hsm.15.1' to '../../../022f/0000/0408/0000/0002/0000/0x200000408:0x22f:0x0' done
lhsmtool_posix[7611]: Action completed, notifying coordinator cookie=0x52602776, FID=[0x200000408:0x22f:0x0], hp_flags=0 err=0
lhsmtool_posix[7611]: llapi_hsm_action_end() on '/lustre/scratch/.lustre/fid/0x200000408:0x22f:0x0' ok (rc=0)
lhsmtool_posix[7610]: copytool fs=scratch archive#=3 item_count=1
lhsmtool_posix[7610]: waiting for message from kernel
lhsmtool_posix[7613]: '[0x200000408:0x231:0x0]' action ARCHIVE reclen 72, cookie=0x52602778
lhsmtool_posix[7613]: processing file 'd0.sanity-hsm/d15/f.sanity-hsm.15.3'
lhsmtool_posix[7613]: archiving '/lustre/scratch/.lustre/fid/0x200000408:0x231:0x0' to '/lustre/archive/0231/0000/0408/0000/0002/0000/0x200000408:0x231:0x0_tmp'
lhsmtool_posix[7613]: saving stripe info of '/lustre/scratch/.lustre/fid/0x200000408:0x231:0x0' in /lustre/archive/0231/0000/0408/0000/0002/0000/0x200000408:0x231:0x0_tmp.lov
lhsmtool_posix[7613]: going to copy data from '/lustre/scratch/.lustre/fid/0x200000408:0x231:0x0' to '/lustre/archive/0231/0000/0408/0000/0002/0000/0x200000408:0x231:0x0_tmp'
lhsmtool_posix[7613]: Going to copy 2097152 bytes /lustre/scratch/.lustre/fid/0x200000408:0x231:0x0 -> /lustre/archive/0231/0000/0408/0000/0002/0000/0x200000408:0x231:0x0_tmp

lhsmtool_posix[7613]: bandwith control: excess=3.145728E+06 sleep for 3000000us
lhsmtool_posix[7612]: data archiving for '/lustre/scratch/.lustre/fid/0x200000408:0x230:0x0' to '/lustre/archive/0230/0000/0408/0000/0002/0000/0x200000408:0x230:0x0_tmp' done
lhsmtool_posix[7612]: attr file for '/lustre/scratch/.lustre/fid/0x200000408:0x230:0x0' saved to archive '/lustre/archive/0230/0000/0408/0000/0002/0000/0x200000408:0x230:0x0_tmp'
lhsmtool_posix[7612]: fsetxattr of 'trusted.hsm' on '/lustre/archive/0230/0000/0408/0000/0002/0000/0x200000408:0x230:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7612]: fsetxattr of 'trusted.link' on '/lustre/archive/0230/0000/0408/0000/0002/0000/0x200000408:0x230:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7612]: fsetxattr of 'trusted.lov' on '/lustre/archive/0230/0000/0408/0000/0002/0000/0x200000408:0x230:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7612]: fsetxattr of 'trusted.lma' on '/lustre/archive/0230/0000/0408/0000/0002/0000/0x200000408:0x230:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7612]: fsetxattr of 'lustre.lov' on '/lustre/archive/0230/0000/0408/0000/0002/0000/0x200000408:0x230:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7612]: xattr file for '/lustre/scratch/.lustre/fid/0x200000408:0x230:0x0' saved to archive '/lustre/archive/0230/0000/0408/0000/0002/0000/0x200000408:0x230:0x0_tmp'
lhsmtool_posix[7612]: symlink '/lustre/archive/shadow/d0.sanity-hsm/d15/f.sanity-hsm.15.2' to '../../../0230/0000/0408/0000/0002/0000/0x200000408:0x230:0x0' done
lhsmtool_posix[7612]: Action completed, notifying coordinator cookie=0x52602777, FID=[0x200000408:0x230:0x0], hp_flags=0 err=0
lhsmtool_posix[7612]: llapi_hsm_action_end() on '/lustre/scratch/.lustre/fid/0x200000408:0x230:0x0' ok (rc=0)
lhsmtool_posix[7610]: copytool fs=scratch archive#=3 item_count=1
lhsmtool_posix[7610]: waiting for message from kernel
lhsmtool_posix[7614]: '[0x200000408:0x233:0x0]' action ARCHIVE reclen 72, cookie=0x5260277a
lhsmtool_posix[7614]: processing file 'd0.sanity-hsm/d15/f.sanity-hsm.15.5'
lhsmtool_posix[7614]: archiving '/lustre/scratch/.lustre/fid/0x200000408:0x233:0x0' to '/lustre/archive/0233/0000/0408/0000/0002/0000/0x200000408:0x233:0x0_tmp'
lhsmtool_posix[7614]: saving stripe info of '/lustre/scratch/.lustre/fid/0x200000408:0x233:0x0' in /lustre/archive/0233/0000/0408/0000/0002/0000/0x200000408:0x233:0x0_tmp.lov
lhsmtool_posix[7614]: going to copy data from '/lustre/scratch/.lustre/fid/0x200000408:0x233:0x0' to '/lustre/archive/0233/0000/0408/0000/0002/0000/0x200000408:0x233:0x0_tmp'
lhsmtool_posix[7614]: Going to copy 2097152 bytes /lustre/scratch/.lustre/fid/0x200000408:0x233:0x0 -> /lustre/archive/0233/0000/0408/0000/0002/0000/0x200000408:0x233:0x0_tmp

lhsmtool_posix[7613]: data archiving for '/lustre/scratch/.lustre/fid/0x200000408:0x231:0x0' to '/lustre/archive/0231/0000/0408/0000/0002/0000/0x200000408:0x231:0x0_tmp' done
lhsmtool_posix[7613]: attr file for '/lustre/scratch/.lustre/fid/0x200000408:0x231:0x0' saved to archive '/lustre/archive/0231/0000/0408/0000/0002/0000/0x200000408:0x231:0x0_tmp'
lhsmtool_posix[7613]: fsetxattr of 'trusted.hsm' on '/lustre/archive/0231/0000/0408/0000/0002/0000/0x200000408:0x231:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7613]: fsetxattr of 'trusted.link' on '/lustre/archive/0231/0000/0408/0000/0002/0000/0x200000408:0x231:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7613]: fsetxattr of 'trusted.lov' on '/lustre/archive/0231/0000/0408/0000/0002/0000/0x200000408:0x231:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7613]: fsetxattr of 'trusted.lma' on '/lustre/archive/0231/0000/0408/0000/0002/0000/0x200000408:0x231:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7613]: fsetxattr of 'lustre.lov' on '/lustre/archive/0231/0000/0408/0000/0002/0000/0x200000408:0x231:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7613]: xattr file for '/lustre/scratch/.lustre/fid/0x200000408:0x231:0x0' saved to archive '/lustre/archive/0231/0000/0408/0000/0002/0000/0x200000408:0x231:0x0_tmp'
lhsmtool_posix[7613]: symlink '/lustre/archive/shadow/d0.sanity-hsm/d15/f.sanity-hsm.15.3' to '../../../0231/0000/0408/0000/0002/0000/0x200000408:0x231:0x0' done
lhsmtool_posix[7613]: Action completed, notifying coordinator cookie=0x52602778, FID=[0x200000408:0x231:0x0], hp_flags=0 err=0
lhsmtool_posix[7613]: llapi_hsm_action_end() on '/lustre/scratch/.lustre/fid/0x200000408:0x231:0x0' ok (rc=0)
lhsmtool_posix[7610]: copytool fs=scratch archive#=3 item_count=1
lhsmtool_posix[7610]: waiting for message from kernel
lhsmtool_posix[7615]: '[0x200000408:0x232:0x0]' action ARCHIVE reclen 72, cookie=0x52602779
lhsmtool_posix[7615]: processing file 'd0.sanity-hsm/d15/f.sanity-hsm.15.4'
lhsmtool_posix[7615]: archiving '/lustre/scratch/.lustre/fid/0x200000408:0x232:0x0' to '/lustre/archive/0232/0000/0408/0000/0002/0000/0x200000408:0x232:0x0_tmp'
lhsmtool_posix[7615]: saving stripe info of '/lustre/scratch/.lustre/fid/0x200000408:0x232:0x0' in /lustre/archive/0232/0000/0408/0000/0002/0000/0x200000408:0x232:0x0_tmp.lov
lhsmtool_posix[7615]: going to copy data from '/lustre/scratch/.lustre/fid/0x200000408:0x232:0x0' to '/lustre/archive/0232/0000/0408/0000/0002/0000/0x200000408:0x232:0x0_tmp'
lhsmtool_posix[7615]: Going to copy 2097152 bytes /lustre/scratch/.lustre/fid/0x200000408:0x232:0x0 -> /lustre/archive/0232/0000/0408/0000/0002/0000/0x200000408:0x232:0x0_tmp

lhsmtool_posix[7614]: data archiving for '/lustre/scratch/.lustre/fid/0x200000408:0x233:0x0' to '/lustre/archive/0233/0000/0408/0000/0002/0000/0x200000408:0x233:0x0_tmp' done
lhsmtool_posix[7614]: attr file for '/lustre/scratch/.lustre/fid/0x200000408:0x233:0x0' saved to archive '/lustre/archive/0233/0000/0408/0000/0002/0000/0x200000408:0x233:0x0_tmp'
lhsmtool_posix[7614]: fsetxattr of 'trusted.hsm' on '/lustre/archive/0233/0000/0408/0000/0002/0000/0x200000408:0x233:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7614]: fsetxattr of 'trusted.link' on '/lustre/archive/0233/0000/0408/0000/0002/0000/0x200000408:0x233:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7614]: fsetxattr of 'trusted.lov' on '/lustre/archive/0233/0000/0408/0000/0002/0000/0x200000408:0x233:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7614]: fsetxattr of 'trusted.lma' on '/lustre/archive/0233/0000/0408/0000/0002/0000/0x200000408:0x233:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7614]: fsetxattr of 'lustre.lov' on '/lustre/archive/0233/0000/0408/0000/0002/0000/0x200000408:0x233:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7614]: xattr file for '/lustre/scratch/.lustre/fid/0x200000408:0x233:0x0' saved to archive '/lustre/archive/0233/0000/0408/0000/0002/0000/0x200000408:0x233:0x0_tmp'
lhsmtool_posix[7614]: symlink '/lustre/archive/shadow/d0.sanity-hsm/d15/f.sanity-hsm.15.5' to '../../../0233/0000/0408/0000/0002/0000/0x200000408:0x233:0x0' done
lhsmtool_posix[7614]: Action completed, notifying coordinator cookie=0x5260277a, FID=[0x200000408:0x233:0x0], hp_flags=0 err=0
lhsmtool_posix[7614]: llapi_hsm_action_end() on '/lustre/scratch/.lustre/fid/0x200000408:0x233:0x0' ok (rc=0)
lhsmtool_posix[7615]: data archiving for '/lustre/scratch/.lustre/fid/0x200000408:0x232:0x0' to '/lustre/archive/0232/0000/0408/0000/0002/0000/0x200000408:0x232:0x0_tmp' done
lhsmtool_posix[7615]: attr file for '/lustre/scratch/.lustre/fid/0x200000408:0x232:0x0' saved to archive '/lustre/archive/0232/0000/0408/0000/0002/0000/0x200000408:0x232:0x0_tmp'
lhsmtool_posix[7615]: fsetxattr of 'trusted.hsm' on '/lustre/archive/0232/0000/0408/0000/0002/0000/0x200000408:0x232:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7615]: fsetxattr of 'trusted.link' on '/lustre/archive/0232/0000/0408/0000/0002/0000/0x200000408:0x232:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7615]: fsetxattr of 'trusted.lov' on '/lustre/archive/0232/0000/0408/0000/0002/0000/0x200000408:0x232:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7615]: fsetxattr of 'trusted.lma' on '/lustre/archive/0232/0000/0408/0000/0002/0000/0x200000408:0x232:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7615]: fsetxattr of 'lustre.lov' on '/lustre/archive/0232/0000/0408/0000/0002/0000/0x200000408:0x232:0x0_tmp' rc=0 (Success)
lhsmtool_posix[7615]: xattr file for '/lustre/scratch/.lustre/fid/0x200000408:0x232:0x0' saved to archive '/lustre/archive/0232/0000/0408/0000/0002/0000/0x200000408:0x232:0x0_tmp'
lhsmtool_posix[7615]: symlink '/lustre/archive/shadow/d0.sanity-hsm/d15/f.sanity-hsm.15.4' to '../../../0232/0000/0408/0000/0002/0000/0x200000408:0x232:0x0' done
lhsmtool_posix[7615]: Action completed, notifying coordinator cookie=0x52602779, FID=[0x200000408:0x232:0x0], hp_flags=0 err=0
lhsmtool_posix[7615]: llapi_hsm_action_end() on '/lustre/scratch/.lustre/fid/0x200000408:0x232:0x0' ok (rc=0)
exiting: Interrupt


 Comments   
Comment by Jinshan Xiong (Inactive) [ 21/Oct/13 ]

the FID in agent_actions didn't match the FID in copytool log. In agent_action it was [0x20000040b:0x74:0x0], but actually it should be [0x200000408:0x22f:0x0].

Probably the archive for the first file had already finished and [0x20000040b:0x74:0x0] was leftover information from previous tests.

In that case, we should only check the status of files we created in the test cases.

This is not a blocker.

Comment by Bruno Faccini (Inactive) [ 28/Oct/13 ]

I agree with Jinshan, this FID causing the time-out looks like an old+orphan one, like I describe in LU-4093.

So yes, a possible solution could be to only check the files/FIDs we create and use for this test, or like I suggest in LU-4093, to ensure no orphan or zombie request is left from previous tests.

Comment by Bruno Faccini (Inactive) [ 04/Nov/13 ]

Patch/change #8157 for LU-4093, if successful should avoid the error/problem for this ticket since no other/old zombie request than the one for the current sub-test should survive on copytool_cleanup between sub-tests.

If not sufficient, we may need to switch to a completion check for each/specific requests from this test, instead of global/wait_all_done().

Comment by Bruno Faccini (Inactive) [ 02/Dec/13 ]

Seems that my previous assumption was true finally and the number of sanity-hsm failures linked to this ticket has significantly reduced since 1st patch for LU-4093 has landed, let's wait a bit more now that 2nd/correction patch for LU-4093 also just landed.

Comment by Bruno Faccini (Inactive) [ 10/Dec/13 ]

No more related failures for master builds reported since Nov 18th. So as expected the 2 changes for LU-4093 made it.

Generated at Sat Feb 10 01:39:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.