HSM _not only_ small fixes and to do list goes here (LU-3647)

[LU-3791] sanity-hsm test_8: 'request on 0x200002341:0x2:0x0 is not SUCCEED' Created: 20/Aug/13  Updated: 12/Sep/13  Resolved: 12/Sep/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.5.0

Type: Technical task Priority: Blocker
Reporter: Maloo Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: HSM, revzfs

Issue Links:
Related
Rank (Obsolete): 9801

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite runs:
http://maloo.whamcloud.com/test_sets/21cdd37a-094d-11e3-b004-52540035b04c
http://maloo.whamcloud.com/test_sets/2ee8358a-0931-11e3-b004-52540035b04c
http://maloo.whamcloud.com/test_sets/25a97004-094c-11e3-a9b0-52540035b04c

The sub-test test_8 failed with the following error:

request on 0x200008101:0x2:0x0 is not SUCCEED

Info required for matching: sanity-hsm 8



 Comments   
Comment by Nathaniel Clark [ 20/Aug/13 ]

This looks a lot like most of the failing tests for zfs in sanity-hsm. hsmtool doesn't seem to be functioning.

Comment by Jinshan Xiong (Inactive) [ 23/Aug/13 ]

Drop the priority since the failure are most for zfs.

Comment by Keith Mannthey (Inactive) [ 04/Sep/13 ]

Ldiskfs sighting:

https://maloo.whamcloud.com/test_sets/074edea0-14e6-11e3-8ae7-52540035b04c

Test 8 + Lots of other sub tests failed in the same way.

Comment by Bobbie Lind (Inactive) [ 05/Sep/13 ]

More failures here

https://maloo.whamcloud.com/test_sets/b32b57f2-1627-11e3-a84e-52540035b04c

Comment by Keith Mannthey (Inactive) [ 05/Sep/13 ]

There is lots of ldiskfs testing failing right now in maloo. Moved back to blocker.

Comment by Andreas Dilger [ 05/Sep/13 ]

It looks like a bad patch landed between 2013-09-02 and 2013-09-04. The first patch that fails 44/90 tests in sanity-hsm is
http://review.whamcloud.com/7490 with parent commit eb2285474 (LU-3581 osc: Lustre returns EINTR from writes when SA_RESTART is set), so it is possibly this parent or a patch that landed shortly before this. You can use a Maloo query to get the list of sanity-hsm failures, and then check which parents are consistently passing, and which ones are causing failures.

Comment by Andreas Dilger [ 06/Sep/13 ]

I did a query in maloo to list all of the recent sanity-hsm tests:

Name: sanity-hsm
Test group: review
Run within: 1 week
Lustre branch: master

which has the following nasty URL that can't be shortened to drop the useless parts:

https://maloo.whamcloud.com/test_sets/query%26test_set[test_set_script_id]=10d5ab1c-78af-11e2-9928-52540035b04c&test_set[status]=&test_set[query_bugs]=&test_session[test_host]=&test_session[test_group]=review&test_session[user_id]=&test_session[query_date]=&test_session[query_recent_period]=604800&test_node[os_type_id]=&test_node[distribution_type_id]=&test_node[architecture_type_id]=&test_node[file_system_type_id]=&test_node[lustre_branch_id]=24a6947e-04a9-11e1-bb5f-52540025f9af&test_node_network[network_type_id]=&commit=Update+results

Using "git bisect good [hash]" and "git bisect bad [hash]" only goes so far when there is an intermittent failure. Instead, I just made a simple histogram using the edited output of "git log --pretty=online --abbrev-commit | head -60" to get a list of the recently committed patches. Then I started looking through the results of the maloo query, clicking on the "gerrit:NNNN" link for each test (pass or fail) and checking the "Parent(s):" line for each patch to see what commit the patch was based on. If there is ambiguity about this, the Maloo test results have the actual commit hash of the patch that was run that should match the has next to the "Patch Set N" line.

This is what I got, marking the number of (g)ood and (B)ad runs on each line:

   B6   23c1979 LU-3811 hsm: handle file ownership and timestamps
        325ce4b LU-3711 mdt: default archive id is tunable
g  B5   536f1a6 LU-3500 mdt: use __u64 for sp_cr_flags
   B6   036f0c6 LU-3788 man: add "lctl set_param -P" to lctl.8 man page
g4 B14  1686463 New tag 2.4.92
        143d09c LU-2800 autoconf: remove LC_RW_TREE_LOCK test
        9699de5 LU-3841 lod: handle released defined layouts
   B2   0067260 LU-2800 autoconf: update LC_INODE_PERMISION_2ARGS test
        07e6e38 LU-2800 autoconf: remove LC_FS_RENAME_DOES_D_MOVE test
        b50f730 LU-3789 mgs: Add deprecation warning for "lctl conf_param"
   B2   eb22854 LU-3581 osc: Lustre returns EINTR from writes when SA_RESTART is set
        3b18568 LU-3549 llapi: add printf attribute to llapi_{printf,error}
        fa16465 LU-3497 build: Use alt. path for ZFS development headers
        1f63b38 LU-2800 autoconf: remove LC_SECURITY_PLUG test
        aa41bab LU-3768 tunefs: make tunefs.lustre work correctly for MGS
        5375158 LU-3787 test: Missing "$" on variable
g       7b28134 LU-2484 obd: add md_stats to MDC and LMV devices
        1c9d076 LU-2675 cleanup: define sparse annotations for libcfs
        50c7a76 LU-2253 tests: get proper free indoes
        5cb3841 LU-3720 test: use actual stripe count in sanity.sh 130d
        a5b4279 LU-3703 tests: skip getfattr part of sanity test 234
        aeb887a LU-3763 utils: set multipath devices recursively
        d8179d2 LU-3508 test: small fix for sanity.sh test_101c
        862cdaa LU-3097 build: fix 'no effect' errors
        842a632 LU-3647 hsm: Add support to drop all pages for ll_data_version
        8d8f071 LU-3561 tests: improve sanity-hsm.sh to support remote agent
g7      3f92a01 LU-1346 libcfs: cleanup cfs_curproc_xxx macros

So the bad patch is almost certainly between 3f92a01..1686463, and maybe between 7b28134..eb22854 though there isn't enough information to be certain (the one pass at 7b28134 may have been a fluke).

Comment by John Hammond [ 06/Sep/13 ]

I would like to rule out '8d8f071 LU-3561 tests: improve sanity-hsm.sh to support remote agent' first as this switches the archive directore from local /tmp/arc to global shared (/home/cgearing/.autotest/shared_dir/2013-09-04/191904-70339907832920/arc1) which I assume NFS shared to the test VMs.

Comment by John Hammond [ 06/Sep/13 ]

Test change at http://review.whamcloud.com/7576.

Comment by Jinshan Xiong (Inactive) [ 06/Sep/13 ]

From what I have seen, the copy process was successful, copying xattrs should be okay as well. So from the snippet of ct_archive(), it should fail around:

        CT_TRACE("'%s' data archived to '%s' done\n", src, dst);

        /* attrs will remain on the MDS; no need to copy them, except possibly
          for disaster recovery */
        if (opt.o_copy_attrs) {
                rc = ct_copy_attr(src, dst, src_fd, dst_fd);
                if (rc < 0) {
                        CT_ERROR("'%s' attr copy failed to '%s' (%s)\n",
                                 src, dst, strerror(-rc));
                        rcf = rc;
                }
                CT_TRACE("'%s' attr file copied to archive '%s'\n",
                         src, dst);
        }

        /* xattrs will remain on the MDS; no need to copy them, except possibly
         for disaster recovery */
        if (opt.o_copy_xattrs) {
                rc = ct_copy_xattr(src, dst, src_fd, dst_fd, false);
                if (rc < 0) {
                        CT_ERROR("'%s' xattr copy failed to '%s' (%s)\n",
                                 src, dst, strerror(-rc));
                        rcf = rcf ? rcf : rc;
                }
                CT_ERROR("'%s' xattr file copied to archive '%s'\n",
                         src, dst);
        }

        if (rename_needed == true) {
                char     tmp_src[PATH_MAX];
                char     tmp_dst[PATH_MAX];

                /* atomically replace old archived file */
                ct_path_archive(src, sizeof(src), opt.o_hsm_root,
                                &hai->hai_fid);
                rc = rename(dst, src);
                if (rc < 0) {
                        CT_ERROR("'%s' renamed to '%s' failed (%s)\n", dst, src,
                                 strerror(errno));
                        rc = -errno;
                        goto fini_major;
                }
                /* rename lov file */
                snprintf(tmp_src, sizeof(tmp_src), "%s.lov", src);
                snprintf(tmp_dst, sizeof(tmp_dst), "%s.lov", dst);
                rc = rename(tmp_dst, tmp_src);
                if (rc < 0)
                        CT_ERROR("'%s' renamed to '%s' failed (%s)\n",
                                 tmp_dst, tmp_src, strerror(errno));
        }

        if (opt.o_shadow_tree) {
                /* Create a namespace of softlinks that shadows the original
                 * Lustre namespace.  This will only be current at
                 * time-of-archive (won't follow renames).
                 * WARNING: release won't kill these links; a manual
                 * cleanup of dead links would be required.
                 */

I suspect that it failed at copying attr if the archive existed on NFS share as John stated.

Comment by Jinshan Xiong (Inactive) [ 06/Sep/13 ]

I just did an experiment on rosso and it verified my guess, as follows:

[root@wtm-12vm2 0000]# ls -l
total 7172
-rw------- 1 nfsnobody nfsnobody 7340032 Sep  3 13:21 0x200002341:0x211:0x0_tmp
-rw------- 1 nfsnobody nfsnobody      56 Sep  3 13:21 0x200002341:0x211:0x0_tmp.lov
[root@wtm-12vm2 0000]# cp /etc/passwd .
[root@wtm-12vm2 0000]# chown root.root passwd
chown: changing ownership of `passwd': Operation not permitted
[root@wtm-12vm2 0000]# pwd
/home/cgearing/.autotest/shared_dir/2013-09-03/040347-70339907358780/arc1/0211/0000/2341/0000/0002/0000

I did the same thing on toro and it worked so this is why the tests only failed sometimes. If the test was running on rosso, it failed. pretty nasty, huh?

Comment by Jinshan Xiong (Inactive) [ 06/Sep/13 ]

I think we can fix this issue by setting correct permission on nfs share.

Comment by John Hammond [ 06/Sep/13 ]

But why are there no error messages in the copytool logs?

Comment by John Hammond [ 06/Sep/13 ]

Please see http://review.whamcloud.com/7581. This change passes the --no-attr and --no-xattr to the copytool in sanity-hsm.sh. Let's see how it does.

Comment by Jinshan Xiong (Inactive) [ 11/Sep/13 ]

I believe this issue will no longer exist after TEI-534 is fixed.

Comment by Aurelien Degremont (Inactive) [ 11/Sep/13 ]

It is possible to know what is in TEI-534?

Comment by Jinshan Xiong (Inactive) [ 11/Sep/13 ]

Hi Aurelien,

To support multiple agents, we're using NFS share so that agents can access the same archive. To archive a lustre file, lhsmtool_posix has to copy file data, xattr and attr to the corresponding file in archive. The problem is to copy attr we have to change the file owner in NFS share.

On rosso, which is a cluster to run autotest, has root_squash enabled, so every attempt to change file owner to root will fail. This then causes archive failure this is why we have seen so many failures on sanity-hsm recently.

TEI-534 is an internal task to request administrator to disable root_squash on the NFS server.

Comment by Bobbie Lind (Inactive) [ 12/Sep/13 ]

I realize this ticket is in a "wont fix" resolved status but I'm still hitting it even with the temporary work around in place from TEI-534 https://maloo.whamcloud.com/test_sets/db1ad8ea-170e-11e3-9d30-52540035b04c

Comment by John Hammond [ 12/Sep/13 ]

Bobbie, "won't fix" was probably not the best resolution here, since this issue was fixed by the configuration changes on rosso. The test session you linked above was run before that fix was applied, however.

Comment by Keith Mannthey (Inactive) [ 12/Sep/13 ]

Also perhaps the test can be fixed up so that it could detects this incompatibility and "skip" it when local resources are not compliant? These tests will be ran in all sorts of environments.

Generated at Sat Feb 10 01:36:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.