[LU-5891] sanity-hsm test_33: request on 0x400000401:0x2b:0x0 is not SUCCEED on mds1 Created: 10/Nov/14  Updated: 26/May/15  Resolved: 26/May/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Bruno Faccini (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 16465

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/2fed1b1e-66e4-11e4-9071-5254006e85c2.

The sub-test test_33 failed with the following error:

request on 0x400000401:0x2b:0x0 is not SUCCEED on mds1

Please provide additional information about the failure here.

Info required for matching: sanity-hsm 33



 Comments   
Comment by Jodi Levi (Inactive) [ 10/Nov/14 ]

Bruno,
Can you please comment on this one?
Thank you!

Comment by Bruno Faccini (Inactive) [ 10/Nov/14 ]

The copytool log for this failure gives some insight about possible issues :

1415312717.472471 lhsmtool_posix[4757]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre
1415312717.690388 lhsmtool_posix[4758]: waiting for message from kernel
1415312724.867146 lhsmtool_posix[4758]: copytool fs=lustre archive#=2 item_count=1
1415312724.867375 lhsmtool_posix[4758]: waiting for message from kernel
1415312724.867438 lhsmtool_posix[4759]: '[0x400000401:0x2b:0x0]' action ARCHIVE reclen 72, cookie=0x545bf506
1415312724.869829 lhsmtool_posix[4759]: processing file 'd33.sanity-hsm/f33.sanity-hsm'
1415312724.911179 lhsmtool_posix[4759]: archiving '/mnt/lustre/.lustre/fid/0x400000401:0x2b:0x0' to '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp'
1415312747.050750 lhsmtool_posix[4759]: saving stripe info of '/mnt/lustre/.lustre/fid/0x400000401:0x2b:0x0' in /home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp.lov
1415312748.396204 lhsmtool_posix[4759]: start copy of 39000000 bytes from '/mnt/lustre/.lustre/fid/0x400000401:0x2b:0x0' to '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp'
1415312778.730187 lhsmtool_posix[4759]: %80 
1415312778.734241 lhsmtool_posix[4759]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s
1415312785.944121 lhsmtool_posix[4759]: copied 39000000 bytes in 37.552927 seconds
1415312821.701157 lhsmtool_posix[4759]: data archiving for '/mnt/lustre/.lustre/fid/0x400000401:0x2b:0x0' to '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp' done
1415312828.187621 lhsmtool_posix[4759]: attr file for '/mnt/lustre/.lustre/fid/0x400000401:0x2b:0x0' saved to archive '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp'
1415312828.189313 lhsmtool_posix[4759]: fsetxattr of 'trusted.hsm' on '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp' rc=-1 (Operation not supported)
1415312828.189370 lhsmtool_posix[4759]: fsetxattr of 'trusted.link' on '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp' rc=-1 (Operation not supported)
1415312828.189413 lhsmtool_posix[4759]: fsetxattr of 'trusted.lov' on '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp' rc=-1 (Operation not supported)
1415312828.189448 lhsmtool_posix[4759]: fsetxattr of 'trusted.lma' on '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp' rc=-1 (Operation not supported)
1415312829.020096 lhsmtool_posix[4759]: fsetxattr of 'lustre.lov' on '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp' rc=-1 (Operation not supported)
1415312829.020131 lhsmtool_posix[4759]: xattr file for '/mnt/lustre/.lustre/fid/0x400000401:0x2b:0x0' saved to archive '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0_tmp'
1415312851.118358 lhsmtool_posix[4759]: symlink '/home/autotest2/.autotest/shared_dir/2014-11-06/094400-70019400985180/arc1/shadow/d33.sanity-hsm/f33.sanity-hsm' to '../../002b/0000/0401/0000/0004/0000/0x400000401:0x2b:0x0' done
1415312851.985183 lhsmtool_posix[4759]: Action completed, notifying coordinator cookie=0x545bf506, FID=[0x400000401:0x2b:0x0], hp_flags=0 err=0
1415312851.989089 lhsmtool_posix[4759]: llapi_hsm_action_end() on '/mnt/lustre/.lustre/fid/0x400000401:0x2b:0x0' ok (rc=0)
exiting: Interrupt

1) repository is shared/NFS. But it does not seem to really slow down the archive process since bandwidth control had to enter in action, and the full file copy only last 37.552927 seconds, which is ok for a 39MB file at 1MB/s rate.
2) it takes about 36s after the end of data copy and before the "data archiving for ... done" message is printed ...
3) it takes about 21s after the end of xattrs save and before the "symlink ... done" message is printed.

This seems to point more likely to some Client/Agent Lustre issues than on a pure NFS repository consequence (like LU-3939).

Comment by James Nunez (Inactive) [ 11/Nov/14 ]

Looks like same issue here for master (review-dne-part-2): https://testing.hpdd.intel.com/test_sessions/413f3bba-66e2-11e4-9071-5254006e85c2

Comment by Bruno Faccini (Inactive) [ 18/May/15 ]

Having a look into the logs/traces of the recent failures attached to the ticket, I am back on my previous suspicions about too slow Lustre Client/Server communications, and I am again suspecting NFS ...
This is because in recent failures, on copytool/aagent side, we can see that between end of ct_copy_data() routine "copied xxxxx bytes ..." msg and the next "data archiving for ..." msg in ct_archive() routine, there is a +30s delay :

1431489947.385520 lhsmtool_posix[10052]: copied 39000000 bytes in 36.865279 seconds
1431489977.951284 lhsmtool_posix[10052]: data archiving for '/mnt/lustre/.lustre/fid/0x200000401:0x82:0x0' to '/home/autotest/.autotest/shared_dir/2015-05-12/161404-70081380872960/arc1/0082/0000/0401/0000/0002/0000/0x200000401:0x82:0x0_tmp' done

when the only processing/operation occurring during that time is :

static int ct_copy_data(struct hsm_copyaction_private *hcp, const char *src,
                        const char *dst, int src_fd, int dst_fd,
                        const struct hsm_action_item *hai, long hal_flags)
{

......................

        CT_TRACE("copied "LPU64" bytes in %f seconds",
                 length, ct_now() - start_ct_now);

        return rc;
}

..................

static int ct_archive(const struct hsm_action_item *hai, const long hal_flags)
{

...................

        rc = ct_copy_data(hcp, src, dst, src_fd, dst_fd, hai, hal_flags);
        if (rc < 0) {
                CT_ERROR(rc, "data copy failed from '%s' to '%s'", src, dst);
                goto fini_major;
        }

        rc = fsync(dst_fd);
        if (rc < 0) {
                rc = -errno;
                CT_ERROR(rc, "cannot flush '%s' archive file '%s'", src, dst);
                goto fini_major;
        }

        CT_TRACE("data archiving for '%s' to '%s' done", src, dst);

Thus, I think this ticket may be considered as a dup of LU-6559.
Will see if this is confirmed now that patch for LU-6559 has just landed on master, and will be back-ported.

Comment by Peter Jones [ 26/May/15 ]

ok. Let's mark as a duplicate and reopen if that theory is disproved

Generated at Sat Feb 10 01:55:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.