[LU-5474] Test failure sanity-hsm test_90: requests did not complete Created: 11/Aug/14 Updated: 03/Apr/16 Resolved: 05/Dec/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0, Lustre 2.5.4 |
| Fix Version/s: | Lustre 2.7.0, Lustre 2.5.4 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | James Nunez (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | 22pl | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 15251 | ||||||||
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: The sub-test test_90 failed with the following error:
Info required for matching: sanity-hsm 90 |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 16/Aug/14 ] |
|
Looks like sanity-hsm/test_90 is also a good candidate to also use a local (ie, non-NFS) HSM root/repository, like sanity-hsm/test_40 for |
| Comment by Bruno Faccini (Inactive) [ 18/Aug/14 ] |
|
In fact I had a more indeed look in one of these sanity-hsm/test_90 latest failures, and even if this sub-test is definitely a good candidate for the same patch than One of the latest sanity-hsm/test_90 failure at https://testing.hpdd.intel.com/sub_tests/381f4220-24ff-11e4-bd77-5254006e85c2 shows that 2 files in the set to be archived never completed on time : CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
Changed after 70s: from 'lrh=[type=10680000 len=136 idx=1/1866] fid=[0x200000401:0x1db:0x0] dfid=[0x200000401:0x1db:0x0] compound/cookie=0x53eecd80/0x53eecd97 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=1/1885] fid=[0x200000401:0x1ee:0x0] dfid=[0x200000401:0x1ee:0x0] compound/cookie=0x53eecd80/0x53eecdaa action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=1/1890] fid=[0x200000401:0x1f3:0x0] dfid=[0x200000401:0x1f3:0x0] compound/cookie=0x53eecd80/0x53eecdaf action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=1/1894] fid=[0x200000401:0x1f6:0x0] dfid=[0x200000401:0x1f6:0x0] compound/cookie=0x53eecd80/0x53eecdb2 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]' to 'lrh=[type=10680000 len=136 idx=1/1866] fid=[0x200000401:0x1db:0x0] dfid=[0x200000401:0x1db:0x0] compound/cookie=0x53eecd80/0x53eecd97 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=1/1894] fid=[0x200000401:0x1f6:0x0] dfid=[0x200000401:0x1f6:0x0] compound/cookie=0x53eecd80/0x53eecdb2 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]'
Waiting 30 secs for update
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
Waiting 20 secs for update
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
Waiting 10 secs for update
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: onyx-51vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
Update not seen after 100s: wanted '' got 'lrh=[type=10680000 len=136 idx=1/1866] fid=[0x200000401:0x1db:0x0] dfid=[0x200000401:0x1db:0x0] compound/cookie=0x53eecd80/0x53eecd97 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=1/1894] fid=[0x200000401:0x1f6:0x0] dfid=[0x200000401:0x1f6:0x0] compound/cookie=0x53eecd80/0x53eecdb2 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]'
sanity-hsm test_90: @@@@@@ FAIL: requests did not complete
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:4526:error_noexit()
= /usr/lib64/lustre/tests/test-framework.sh:4557:error()
= /usr/lib64/lustre/tests/sanity-hsm.sh:678:wait_all_done()
= /usr/lib64/lustre/tests/sanity-hsm.sh:2990:test_90()
= /usr/lib64/lustre/tests/test-framework.sh:4819:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:4854:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:4706:run_test()
= /usr/lib64/lustre/tests/sanity-hsm.sh:2998:main()
Dumping lctl log to /logdir/test_logs/2014-08-15/lustre-reviews-el6-x86_64--review-zfs--2_4_1__25865__-70345348471240-053131/sanity-hsm.test_90.*.1408159245.log
CMD: onyx-51vm3,onyx-51vm4,onyx-51vm5,onyx-51vm6.onyx.hpdd.intel.com /usr/sbin/lctl dk > /logdir/test_logs/2014-08-15/lustre-reviews-el6-x86_64--review-zfs--2_4_1__25865__-70345348471240-053131/sanity-hsm.test_90.debug_log.\$(hostname -s).1408159245.log;
dmesg > /logdir/test_logs/2014-08-15/lustre-reviews-el6-x86_64--review-zfs--2_4_1__25865__-70345348471240-053131/sanity-hsm.test_90.dmesg.\$(hostname -s).1408159245.log
Looks like both files archiving stopped at the "symlink" step : bfaccini-mac01:~ bfaccini$ grep 0x200000401:0x1db:0x0 Downloads//show_text.txt lhsmtool_posix[31299]: '[0x200000401:0x1db:0x0]' action ARCHIVE reclen 72, cookie=0x53eecd97 lhsmtool_posix[31299]: archiving '/mnt/lustre/.lustre/fid/0x200000401:0x1db:0x0' to '/shared_test/2014-08-15/053137-70345348471240/arc1/01db/0000/0401/0000/0002/0000/0x200000401:0x1db:0x0_tmp' lhsmtool_posix[31299]: saving stripe info of '/mnt/lustre/.lustre/fid/0x200000401:0x1db:0x0' in /shared_test/2014-08-15/053137-70345348471240/arc1/01db/0000/0401/0000/0002/0000/0x200000401:0x1db:0x0_tmp.lov lhsmtool_posix[31299]: going to copy data from '/mnt/lustre/.lustre/fid/0x200000401:0x1db:0x0' to '/shared_test/2014-08-15/053137-70345348471240/arc1/01db/0000/0401/0000/0002/0000/0x200000401:0x1db:0x0_tmp' lhsmtool_posix[31299]: data archiving for '/mnt/lustre/.lustre/fid/0x200000401:0x1db:0x0' to '/shared_test/2014-08-15/053137-70345348471240/arc1/01db/0000/0401/0000/0002/0000/0x200000401:0x1db:0x0_tmp' done lhsmtool_posix[31299]: attr file for '/mnt/lustre/.lustre/fid/0x200000401:0x1db:0x0' saved to archive '/shared_test/2014-08-15/053137-70345348471240/arc1/01db/0000/0401/0000/0002/0000/0x200000401:0x1db:0x0_tmp' lhsmtool_posix[31299]: fsetxattr of 'trusted.hsm' on '/shared_test/2014-08-15/053137-70345348471240/arc1/01db/0000/0401/0000/0002/0000/0x200000401:0x1db:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[31299]: fsetxattr of 'trusted.version' on '/shared_test/2014-08-15/053137-70345348471240/arc1/01db/0000/0401/0000/0002/0000/0x200000401:0x1db:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[31299]: fsetxattr of 'trusted.link' on '/shared_test/2014-08-15/053137-70345348471240/arc1/01db/0000/0401/0000/0002/0000/0x200000401:0x1db:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[31299]: fsetxattr of 'trusted.lov' on '/shared_test/2014-08-15/053137-70345348471240/arc1/01db/0000/0401/0000/0002/0000/0x200000401:0x1db:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[31299]: fsetxattr of 'trusted.lma' on '/shared_test/2014-08-15/053137-70345348471240/arc1/01db/0000/0401/0000/0002/0000/0x200000401:0x1db:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[31299]: fsetxattr of 'lustre.lov' on '/shared_test/2014-08-15/053137-70345348471240/arc1/01db/0000/0401/0000/0002/0000/0x200000401:0x1db:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[31299]: xattr file for '/mnt/lustre/.lustre/fid/0x200000401:0x1db:0x0' saved to archive '/shared_test/2014-08-15/053137-70345348471240/arc1/01db/0000/0401/0000/0002/0000/0x200000401:0x1db:0x0_tmp' bfaccini-mac01:~ bfaccini$ bfaccini-mac01:~ bfaccini$ grep 0x200000401:0x1f6:0x0 Downloads//show_text.txt lhsmtool_posix[31326]: '[0x200000401:0x1f6:0x0]' action ARCHIVE reclen 72, cookie=0x53eecdb2 lhsmtool_posix[31326]: archiving '/mnt/lustre/.lustre/fid/0x200000401:0x1f6:0x0' to '/shared_test/2014-08-15/053137-70345348471240/arc1/01f6/0000/0401/0000/0002/0000/0x200000401:0x1f6:0x0_tmp' lhsmtool_posix[31326]: saving stripe info of '/mnt/lustre/.lustre/fid/0x200000401:0x1f6:0x0' in /shared_test/2014-08-15/053137-70345348471240/arc1/01f6/0000/0401/0000/0002/0000/0x200000401:0x1f6:0x0_tmp.lov lhsmtool_posix[31326]: going to copy data from '/mnt/lustre/.lustre/fid/0x200000401:0x1f6:0x0' to '/shared_test/2014-08-15/053137-70345348471240/arc1/01f6/0000/0401/0000/0002/0000/0x200000401:0x1f6:0x0_tmp' lhsmtool_posix[31326]: data archiving for '/mnt/lustre/.lustre/fid/0x200000401:0x1f6:0x0' to '/shared_test/2014-08-15/053137-70345348471240/arc1/01f6/0000/0401/0000/0002/0000/0x200000401:0x1f6:0x0_tmp' done lhsmtool_posix[31326]: attr file for '/mnt/lustre/.lustre/fid/0x200000401:0x1f6:0x0' saved to archive '/shared_test/2014-08-15/053137-70345348471240/arc1/01f6/0000/0401/0000/0002/0000/0x200000401:0x1f6:0x0_tmp' lhsmtool_posix[31326]: fsetxattr of 'trusted.hsm' on '/shared_test/2014-08-15/053137-70345348471240/arc1/01f6/0000/0401/0000/0002/0000/0x200000401:0x1f6:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[31326]: fsetxattr of 'trusted.version' on '/shared_test/2014-08-15/053137-70345348471240/arc1/01f6/0000/0401/0000/0002/0000/0x200000401:0x1f6:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[31326]: fsetxattr of 'trusted.link' on '/shared_test/2014-08-15/053137-70345348471240/arc1/01f6/0000/0401/0000/0002/0000/0x200000401:0x1f6:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[31326]: fsetxattr of 'trusted.lov' on '/shared_test/2014-08-15/053137-70345348471240/arc1/01f6/0000/0401/0000/0002/0000/0x200000401:0x1f6:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[31326]: fsetxattr of 'trusted.lma' on '/shared_test/2014-08-15/053137-70345348471240/arc1/01f6/0000/0401/0000/0002/0000/0x200000401:0x1f6:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[31326]: fsetxattr of 'lustre.lov' on '/shared_test/2014-08-15/053137-70345348471240/arc1/01f6/0000/0401/0000/0002/0000/0x200000401:0x1f6:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[31326]: xattr file for '/mnt/lustre/.lustre/fid/0x200000401:0x1f6:0x0' saved to archive '/shared_test/2014-08-15/053137-70345348471240/arc1/01f6/0000/0401/0000/0002/0000/0x200000401:0x1f6:0x0_tmp' bfaccini-mac01:~ bfaccini$ vs for a completed file/archiving : bfaccini-mac01:~ bfaccini$ grep 0x200000401:0x1f0:0x0 Downloads//show_text.txt lhsmtool_posix[31320]: '[0x200000401:0x1f0:0x0]' action ARCHIVE reclen 72, cookie=0x53eecdac lhsmtool_posix[31320]: archiving '/mnt/lustre/.lustre/fid/0x200000401:0x1f0:0x0' to '/shared_test/2014-08-15/053137-70345348471240/arc1/01f0/0000/0401/0000/0002/0000/0x200000401:0x1f0:0x0_tmp' lhsmtool_posix[31320]: saving stripe info of '/mnt/lustre/.lustre/fid/0x200000401:0x1f0:0x0' in /shared_test/2014-08-15/053137-70345348471240/arc1/01f0/0000/0401/0000/0002/0000/0x200000401:0x1f0:0x0_tmp.lov lhsmtool_posix[31320]: going to copy data from '/mnt/lustre/.lustre/fid/0x200000401:0x1f0:0x0' to '/shared_test/2014-08-15/053137-70345348471240/arc1/01f0/0000/0401/0000/0002/0000/0x200000401:0x1f0:0x0_tmp' lhsmtool_posix[31320]: data archiving for '/mnt/lustre/.lustre/fid/0x200000401:0x1f0:0x0' to '/shared_test/2014-08-15/053137-70345348471240/arc1/01f0/0000/0401/0000/0002/0000/0x200000401:0x1f0:0x0_tmp' done lhsmtool_posix[31320]: attr file for '/mnt/lustre/.lustre/fid/0x200000401:0x1f0:0x0' saved to archive '/shared_test/2014-08-15/053137-70345348471240/arc1/01f0/0000/0401/0000/0002/0000/0x200000401:0x1f0:0x0_tmp' lhsmtool_posix[31320]: fsetxattr of 'trusted.hsm' on '/shared_test/2014-08-15/053137-70345348471240/arc1/01f0/0000/0401/0000/0002/0000/0x200000401:0x1f0:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[31320]: fsetxattr of 'trusted.version' on '/shared_test/2014-08-15/053137-70345348471240/arc1/01f0/0000/0401/0000/0002/0000/0x200000401:0x1f0:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[31320]: fsetxattr of 'trusted.link' on '/shared_test/2014-08-15/053137-70345348471240/arc1/01f0/0000/0401/0000/0002/0000/0x200000401:0x1f0:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[31320]: fsetxattr of 'trusted.lov' on '/shared_test/2014-08-15/053137-70345348471240/arc1/01f0/0000/0401/0000/0002/0000/0x200000401:0x1f0:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[31320]: fsetxattr of 'trusted.lma' on '/shared_test/2014-08-15/053137-70345348471240/arc1/01f0/0000/0401/0000/0002/0000/0x200000401:0x1f0:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[31320]: fsetxattr of 'lustre.lov' on '/shared_test/2014-08-15/053137-70345348471240/arc1/01f0/0000/0401/0000/0002/0000/0x200000401:0x1f0:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[31320]: xattr file for '/mnt/lustre/.lustre/fid/0x200000401:0x1f0:0x0' saved to archive '/shared_test/2014-08-15/053137-70345348471240/arc1/01f0/0000/0401/0000/0002/0000/0x200000401:0x1f0:0x0_tmp' lhsmtool_posix[31320]: symlink '/shared_test/2014-08-15/053137-70345348471240/arc1/shadow/d90.sanity-hsm/f90.sanity-hsm.45' to '../../01f0/0000/0401/0000/0002/0000/0x200000401:0x1f0:0x0' done lhsmtool_posix[31320]: Action completed, notifying coordinator cookie=0x53eecdac, FID=[0x200000401:0x1f0:0x0], hp_flags=0 err=0 lhsmtool_posix[31320]: llapi_hsm_action_end() on '/mnt/lustre/.lustre/fid/0x200000401:0x1f0:0x0' ok (rc=0) bfaccini-mac01:~ bfaccini$ even if this can also be a NFS issue that could be fixed with a similar patch than |
| Comment by nasf (Inactive) [ 20/Sep/14 ] |
|
Any progress on this ticket? I hit this failure some times recently: |
| Comment by Peter Jones [ 23/Sep/14 ] |
|
James Could you please look into this ticket? Thanks Peter |
| Comment by Jodi Levi (Inactive) [ 24/Sep/14 ] |
|
TEI-2702 is disabling this test. |
| Comment by John Hammond [ 24/Sep/14 ] |
|
Looking at the results from http://review.whamcloud.com/#/c/12021/ I see that none of the sub tests we run in sanity-hsm require a shared archive. OTOH as the same results show there are some places where we use make_small() but should probably use make_large_for_cancel(). Or the file size used in make small should be increased beyond 2MB. (Or deterministic synchronization should be used.) |
| Comment by Andreas Dilger [ 30/Sep/14 ] |
|
The first and most important action is to submit a patch to disable this test via ALWAYS_EXCEPT, and get this landed so that it can be re-enabled at the TEI level again so that it can be re-enabled in the patch that is fixing the problem. |
| Comment by James Nunez (Inactive) [ 30/Sep/14 ] |
|
Patch to disable sanity-hsm test 90: http://review.whamcloud.com/12136 |
| Comment by Jodi Levi (Inactive) [ 30/Sep/14 ] |
|
Ok, once Change, 12136 lands, I will get TEI-2705 completed so that the test will be enabled at the TEI level. |
| Comment by Andreas Dilger [ 04/Nov/14 ] |
|
It looks like the sanity-hsm test_90 exception was removed in the autotest framework, so it is possible to re-enable this test along with a patch to fix it. |
| Comment by James Nunez (Inactive) [ 04/Nov/14 ] |
|
Test patch at http://review.whamcloud.com/#/c/12069/ . This patch is to collect data on sanity-hsm test 90 failures. Looking over previous logs does not point to a problem in the test nor in the Lustre/HSM code. It just looks like archiving 51 files sometimes takes longer than 100 seconds. The patch tests if allowing for more time allows the files to be archived. |
| Comment by Andreas Dilger [ 07/Nov/14 ] |
|
I recall seeing a patch to move the HSM archive into $TMP instead of the shared directory. That might speed up the testing, as long as the archive doesn't need to be too large. |
| Comment by Bruno Faccini (Inactive) [ 08/Nov/14 ] |
|
Andreas, yes this is tracked in |
| Comment by nasf (Inactive) [ 10/Nov/14 ] |
|
Another failure instance on b2_5: |
| Comment by Andreas Dilger [ 13/Nov/14 ] |
|
Bruno, can you please make a patch to sanity-hsm test_90() to use $TMP for the HSM Archive, as was done with test_40()? Is there any reason not to make all HSM tests use $TMP for the archive by default, and only make some tests use the NFS shared directory if needed (very large archive, or HSM failover testing)? That would speed up all the HSM tests I think, unless it filled up $TMP and caused test failures. |
| Comment by James Nunez (Inactive) [ 13/Nov/14 ] |
|
Andreas, |
| Comment by Gerrit Updater [ 17/Nov/14 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/12069 |
| Comment by James Nunez (Inactive) [ 18/Nov/14 ] |
|
I've updated sanity-hsm test 90 to use a local disk for archive as was implemented in test 40. For the last two runs with this change, test 90 has failed to archive any files from the file list; https://testing.hpdd.intel.com/test_sessions/5ac85a16-6ee0-11e4-81e7-5254006e85c2 and https://testing.hpdd.intel.com/test_sets/616b778e-6b77-11e4-b1b4-5254006e85c2 . For test 90, it seems that the archive number specified to the copytool is not the same number as the archive requests. The archive number specified in the call to the copytool is #2 and the archive requests are looking for archive #1. From the client test log: CMD: shadow-3vm9 lhsmtool_posix --daemon --hsm-root /tmp/d90.sanity-hsm --archive 2 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2014-11-17/lustre-reviews-el6-x86_64--review-dne-part-2--2_9_1__28407__-70078147694660-145306/sanity-hsm.test_90.copytool2_log.shadow-3vm9.log 2>&1 ... Changed after 8s: from 'lrh=[type=10680000 len=136 idx=1/1846] fid=[0x400000401:0x1c4:0x0] dfid=[0x400000401:0x1c4:0x0] compound/cookie=0x546ab2b6/0x546ab2b6 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[] lrh=[type=10680000 len=136 idx=1/1847] fid=[0x400000401:0x1c5:0x0] dfid=[0x400000401:0x1c5:0x0] compound/cookie=0x546ab2b6/0x546ab2b7 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[] On the MDS, it complains that it can't find the agent, which is true, archive #2 was started by the copytool: LustreError: 14430:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) lustre-MDT0000: Cannot find agent for archive 1: rc = -11 Yet, test 40 uses essentially the same calls to start the copytool and the archive numbers for this test are consistent. Test 90 archives from a file list and test 40 does not. |
| Comment by Gerrit Updater [ 05/Dec/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12069/ |
| Comment by Peter Jones [ 05/Dec/14 ] |
|
Landed for 2.7 |
| Comment by Gerrit Updater [ 05/Dec/14 ] |
|
Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/12963 |
| Comment by Jian Yu [ 14/Dec/14 ] |
|
The failure occurred frequently recently on Lustre b2_5 branch: |
| Comment by Gerrit Updater [ 27/Jan/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12963/ |