[LU-5891] sanity-hsm test_33: request on 0x400000401:0x2b:0x0 is not SUCCEED on mds1 Created: 10/Nov/14 Updated: 26/May/15 Resolved: 26/May/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 16465 |
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/2fed1b1e-66e4-11e4-9071-5254006e85c2. The sub-test test_33 failed with the following error: request on 0x400000401:0x2b:0x0 is not SUCCEED on mds1 Please provide additional information about the failure here. Info required for matching: sanity-hsm 33 |
| Comments |
| Comment by Jodi Levi (Inactive) [ 10/Nov/14 ] |
|
Bruno, |
| Comment by Bruno Faccini (Inactive) [ 10/Nov/14 ] |
|
The copytool log for this failure gives some insight about possible issues : 1415312717.472471 lhsmtool_posix[4757]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre 1415312717.690388 lhsmtool_posix[4758]: waiting for message from kernel 1415312724.867146 lhsmtool_posix[4758]: copytool fs=lustre archive#=2 item_count=1 1415312724.867375 lhsmtool_posix[4758]: waiting for message from kernel 1415312724.867438 lhsmtool_posix[4759]: '[0x400000401:0x2b:0x0]' action ARCHIVE reclen 72, cookie=0x545bf506 1415312724.869829 lhsmtool_posix[4759]: processing file 'd33.sanity-hsm/f33.sanity-hsm' 1415312724.911179 lhsmtool_posix[4759]: archiving '/mnt/lustre/.lustre/fid/0x400000401:0x2b:0x0' to '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp' 1415312747.050750 lhsmtool_posix[4759]: saving stripe info of '/mnt/lustre/.lustre/fid/0x400000401:0x2b:0x0' in /home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp.lov 1415312748.396204 lhsmtool_posix[4759]: start copy of 39000000 bytes from '/mnt/lustre/.lustre/fid/0x400000401:0x2b:0x0' to '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp' 1415312778.730187 lhsmtool_posix[4759]: %80 1415312778.734241 lhsmtool_posix[4759]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s 1415312785.944121 lhsmtool_posix[4759]: copied 39000000 bytes in 37.552927 seconds 1415312821.701157 lhsmtool_posix[4759]: data archiving for '/mnt/lustre/.lustre/fid/0x400000401:0x2b:0x0' to '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp' done 1415312828.187621 lhsmtool_posix[4759]: attr file for '/mnt/lustre/.lustre/fid/0x400000401:0x2b:0x0' saved to archive '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp' 1415312828.189313 lhsmtool_posix[4759]: fsetxattr of 'trusted.hsm' on '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp' rc=-1 (Operation not supported) 1415312828.189370 lhsmtool_posix[4759]: fsetxattr of 'trusted.link' on '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp' rc=-1 (Operation not supported) 1415312828.189413 lhsmtool_posix[4759]: fsetxattr of 'trusted.lov' on '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp' rc=-1 (Operation not supported) 1415312828.189448 lhsmtool_posix[4759]: fsetxattr of 'trusted.lma' on '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp' rc=-1 (Operation not supported) 1415312829.020096 lhsmtool_posix[4759]: fsetxattr of 'lustre.lov' on '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp' rc=-1 (Operation not supported) 1415312829.020131 lhsmtool_posix[4759]: xattr file for '/mnt/lustre/.lustre/fid/0x400000401:0x2b:0x0' saved to archive '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp' 1415312851.118358 lhsmtool_posix[4759]: symlink '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/shadow/d33.sanity-hsm/f33.sanity-hsm' to '../../002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0' done 1415312851.985183 lhsmtool_posix[4759]: Action completed, notifying coordinator cookie=0x545bf506, FID=[0x400000401:0x2b:0x0], hp_flags=0 err=0 1415312851.989089 lhsmtool_posix[4759]: llapi_hsm_action_end() on '/mnt/lustre/.lustre/fid/0x400000401:0x2b:0x0' ok (rc=0) exiting: Interrupt 1) repository is shared/NFS. But it does not seem to really slow down the archive process since bandwidth control had to enter in action, and the full file copy only last 37.552927 seconds, which is ok for a 39MB file at 1MB/s rate. This seems to point more likely to some Client/Agent Lustre issues than on a pure NFS repository consequence (like |
| Comment by James Nunez (Inactive) [ 11/Nov/14 ] |
|
Looks like same issue here for master (review-dne-part-2): https://testing.hpdd.intel.com/test_sessions/413f3bba-66e2-11e4-9071-5254006e85c2 |
| Comment by Bruno Faccini (Inactive) [ 18/May/15 ] |
|
Having a look into the logs/traces of the recent failures attached to the ticket, I am back on my previous suspicions about too slow Lustre Client/Server communications, and I am again suspecting NFS ... 1431489947.385520 lhsmtool_posix[10052]: copied 39000000 bytes in 36.865279 seconds 1431489977.951284 lhsmtool_posix[10052]: data archiving for '/mnt/lustre/.lustre/fid/0x200000401:0x82:0x0' to '/home/autotest/.autotest/shared_dir/2015-05-12/161404-70081380872960/arc1/0082/0000/0401/0000/0002/0000/0x200000401:0x82:0x0_tmp' done when the only processing/operation occurring during that time is : static int ct_copy_data(struct hsm_copyaction_private *hcp, const char *src,
const char *dst, int src_fd, int dst_fd,
const struct hsm_action_item *hai, long hal_flags)
{
......................
CT_TRACE("copied "LPU64" bytes in %f seconds",
length, ct_now() - start_ct_now);
return rc;
}
..................
static int ct_archive(const struct hsm_action_item *hai, const long hal_flags)
{
...................
rc = ct_copy_data(hcp, src, dst, src_fd, dst_fd, hai, hal_flags);
if (rc < 0) {
CT_ERROR(rc, "data copy failed from '%s' to '%s'", src, dst);
goto fini_major;
}
rc = fsync(dst_fd);
if (rc < 0) {
rc = -errno;
CT_ERROR(rc, "cannot flush '%s' archive file '%s'", src, dst);
goto fini_major;
}
CT_TRACE("data archiving for '%s' to '%s' done", src, dst);
Thus, I think this ticket may be considered as a dup of |
| Comment by Peter Jones [ 26/May/15 ] |
|
ok. Let's mark as a duplicate and reopen if that theory is disproved |