Details

    • Technical task
    • Resolution: Fixed
    • Blocker
    • Lustre 2.5.0
    • Lustre 2.5.0
    • 9801

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite runs:
      http://maloo.whamcloud.com/test_sets/21cdd37a-094d-11e3-b004-52540035b04c
      http://maloo.whamcloud.com/test_sets/2ee8358a-0931-11e3-b004-52540035b04c
      http://maloo.whamcloud.com/test_sets/25a97004-094c-11e3-a9b0-52540035b04c

      The sub-test test_8 failed with the following error:

      request on 0x200008101:0x2:0x0 is not SUCCEED

      Info required for matching: sanity-hsm 8

      Attachments

        Activity

          [LU-3791] sanity-hsm test_8: 'request on 0x200002341:0x2:0x0 is not SUCCEED'

          I just did an experiment on rosso and it verified my guess, as follows:

          [root@wtm-12vm2 0000]# ls -l
          total 7172
          -rw------- 1 nfsnobody nfsnobody 7340032 Sep  3 13:21 0x200002341:0x211:0x0_tmp
          -rw------- 1 nfsnobody nfsnobody      56 Sep  3 13:21 0x200002341:0x211:0x0_tmp.lov
          [root@wtm-12vm2 0000]# cp /etc/passwd .
          [root@wtm-12vm2 0000]# chown root.root passwd
          chown: changing ownership of `passwd': Operation not permitted
          [root@wtm-12vm2 0000]# pwd
          /home/cgearing/.autotest/shared_dir/2013-09-03/040347-70339907358780/arc1/0211/0000/2341/0000/0002/0000
          

          I did the same thing on toro and it worked so this is why the tests only failed sometimes. If the test was running on rosso, it failed. pretty nasty, huh?

          jay Jinshan Xiong (Inactive) added a comment - I just did an experiment on rosso and it verified my guess, as follows: [root@wtm-12vm2 0000]# ls -l total 7172 -rw------- 1 nfsnobody nfsnobody 7340032 Sep 3 13:21 0x200002341:0x211:0x0_tmp -rw------- 1 nfsnobody nfsnobody 56 Sep 3 13:21 0x200002341:0x211:0x0_tmp.lov [root@wtm-12vm2 0000]# cp /etc/passwd . [root@wtm-12vm2 0000]# chown root.root passwd chown: changing ownership of `passwd': Operation not permitted [root@wtm-12vm2 0000]# pwd /home/cgearing/.autotest/shared_dir/2013-09-03/040347-70339907358780/arc1/0211/0000/2341/0000/0002/0000 I did the same thing on toro and it worked so this is why the tests only failed sometimes. If the test was running on rosso, it failed. pretty nasty, huh?

          From what I have seen, the copy process was successful, copying xattrs should be okay as well. So from the snippet of ct_archive(), it should fail around:

                  CT_TRACE("'%s' data archived to '%s' done\n", src, dst);
          
                  /* attrs will remain on the MDS; no need to copy them, except possibly
                    for disaster recovery */
                  if (opt.o_copy_attrs) {
                          rc = ct_copy_attr(src, dst, src_fd, dst_fd);
                          if (rc < 0) {
                                  CT_ERROR("'%s' attr copy failed to '%s' (%s)\n",
                                           src, dst, strerror(-rc));
                                  rcf = rc;
                          }
                          CT_TRACE("'%s' attr file copied to archive '%s'\n",
                                   src, dst);
                  }
          
                  /* xattrs will remain on the MDS; no need to copy them, except possibly
                   for disaster recovery */
                  if (opt.o_copy_xattrs) {
                          rc = ct_copy_xattr(src, dst, src_fd, dst_fd, false);
                          if (rc < 0) {
                                  CT_ERROR("'%s' xattr copy failed to '%s' (%s)\n",
                                           src, dst, strerror(-rc));
                                  rcf = rcf ? rcf : rc;
                          }
                          CT_ERROR("'%s' xattr file copied to archive '%s'\n",
                                   src, dst);
                  }
          
                  if (rename_needed == true) {
                          char     tmp_src[PATH_MAX];
                          char     tmp_dst[PATH_MAX];
          
                          /* atomically replace old archived file */
                          ct_path_archive(src, sizeof(src), opt.o_hsm_root,
                                          &hai->hai_fid);
                          rc = rename(dst, src);
                          if (rc < 0) {
                                  CT_ERROR("'%s' renamed to '%s' failed (%s)\n", dst, src,
                                           strerror(errno));
                                  rc = -errno;
                                  goto fini_major;
                          }
                          /* rename lov file */
                          snprintf(tmp_src, sizeof(tmp_src), "%s.lov", src);
                          snprintf(tmp_dst, sizeof(tmp_dst), "%s.lov", dst);
                          rc = rename(tmp_dst, tmp_src);
                          if (rc < 0)
                                  CT_ERROR("'%s' renamed to '%s' failed (%s)\n",
                                           tmp_dst, tmp_src, strerror(errno));
                  }
          
                  if (opt.o_shadow_tree) {
                          /* Create a namespace of softlinks that shadows the original
                           * Lustre namespace.  This will only be current at
                           * time-of-archive (won't follow renames).
                           * WARNING: release won't kill these links; a manual
                           * cleanup of dead links would be required.
                           */
          

          I suspect that it failed at copying attr if the archive existed on NFS share as John stated.

          jay Jinshan Xiong (Inactive) added a comment - From what I have seen, the copy process was successful, copying xattrs should be okay as well. So from the snippet of ct_archive(), it should fail around: CT_TRACE( " '%s' data archived to '%s' done\n" , src, dst); /* attrs will remain on the MDS; no need to copy them, except possibly for disaster recovery */ if (opt.o_copy_attrs) { rc = ct_copy_attr(src, dst, src_fd, dst_fd); if (rc < 0) { CT_ERROR( " '%s' attr copy failed to '%s' (%s)\n" , src, dst, strerror(-rc)); rcf = rc; } CT_TRACE( " '%s' attr file copied to archive '%s' \n" , src, dst); } /* xattrs will remain on the MDS; no need to copy them, except possibly for disaster recovery */ if (opt.o_copy_xattrs) { rc = ct_copy_xattr(src, dst, src_fd, dst_fd, false ); if (rc < 0) { CT_ERROR( " '%s' xattr copy failed to '%s' (%s)\n" , src, dst, strerror(-rc)); rcf = rcf ? rcf : rc; } CT_ERROR( " '%s' xattr file copied to archive '%s' \n" , src, dst); } if (rename_needed == true ) { char tmp_src[PATH_MAX]; char tmp_dst[PATH_MAX]; /* atomically replace old archived file */ ct_path_archive(src, sizeof(src), opt.o_hsm_root, &hai->hai_fid); rc = rename(dst, src); if (rc < 0) { CT_ERROR( " '%s' renamed to '%s' failed (%s)\n" , dst, src, strerror(errno)); rc = -errno; goto fini_major; } /* rename lov file */ snprintf(tmp_src, sizeof(tmp_src), "%s.lov" , src); snprintf(tmp_dst, sizeof(tmp_dst), "%s.lov" , dst); rc = rename(tmp_dst, tmp_src); if (rc < 0) CT_ERROR( " '%s' renamed to '%s' failed (%s)\n" , tmp_dst, tmp_src, strerror(errno)); } if (opt.o_shadow_tree) { /* Create a namespace of softlinks that shadows the original * Lustre namespace. This will only be current at * time-of-archive (won't follow renames). * WARNING: release won't kill these links; a manual * cleanup of dead links would be required. */ I suspect that it failed at copying attr if the archive existed on NFS share as John stated.
          jhammond John Hammond added a comment - Test change at http://review.whamcloud.com/7576 .
          jhammond John Hammond added a comment -

          I would like to rule out '8d8f071 LU-3561 tests: improve sanity-hsm.sh to support remote agent' first as this switches the archive directore from local /tmp/arc to global shared (/home/cgearing/.autotest/shared_dir/2013-09-04/191904-70339907832920/arc1) which I assume NFS shared to the test VMs.

          jhammond John Hammond added a comment - I would like to rule out '8d8f071 LU-3561 tests: improve sanity-hsm.sh to support remote agent' first as this switches the archive directore from local /tmp/arc to global shared (/home/cgearing/.autotest/shared_dir/2013-09-04/191904-70339907832920/arc1) which I assume NFS shared to the test VMs.
          adilger Andreas Dilger added a comment - - edited

          I did a query in maloo to list all of the recent sanity-hsm tests:

          Name: sanity-hsm
          Test group: review
          Run within: 1 week
          Lustre branch: master
          

          which has the following nasty URL that can't be shortened to drop the useless parts:

          https://maloo.whamcloud.com/test_sets/query%26test_set[test_set_script_id]=10d5ab1c-78af-11e2-9928-52540035b04c&test_set[status]=&test_set[query_bugs]=&test_session[test_host]=&test_session[test_group]=review&test_session[user_id]=&test_session[query_date]=&test_session[query_recent_period]=604800&test_node[os_type_id]=&test_node[distribution_type_id]=&test_node[architecture_type_id]=&test_node[file_system_type_id]=&test_node[lustre_branch_id]=24a6947e-04a9-11e1-bb5f-52540025f9af&test_node_network[network_type_id]=&commit=Update+results

          Using "git bisect good [hash]" and "git bisect bad [hash]" only goes so far when there is an intermittent failure. Instead, I just made a simple histogram using the edited output of "git log --pretty=online --abbrev-commit | head -60" to get a list of the recently committed patches. Then I started looking through the results of the maloo query, clicking on the "gerrit:NNNN" link for each test (pass or fail) and checking the "Parent(s):" line for each patch to see what commit the patch was based on. If there is ambiguity about this, the Maloo test results have the actual commit hash of the patch that was run that should match the has next to the "Patch Set N" line.

          This is what I got, marking the number of (g)ood and (B)ad runs on each line:

             B6   23c1979 LU-3811 hsm: handle file ownership and timestamps
                  325ce4b LU-3711 mdt: default archive id is tunable
          g  B5   536f1a6 LU-3500 mdt: use __u64 for sp_cr_flags
             B6   036f0c6 LU-3788 man: add "lctl set_param -P" to lctl.8 man page
          g4 B14  1686463 New tag 2.4.92
                  143d09c LU-2800 autoconf: remove LC_RW_TREE_LOCK test
                  9699de5 LU-3841 lod: handle released defined layouts
             B2   0067260 LU-2800 autoconf: update LC_INODE_PERMISION_2ARGS test
                  07e6e38 LU-2800 autoconf: remove LC_FS_RENAME_DOES_D_MOVE test
                  b50f730 LU-3789 mgs: Add deprecation warning for "lctl conf_param"
             B2   eb22854 LU-3581 osc: Lustre returns EINTR from writes when SA_RESTART is set
                  3b18568 LU-3549 llapi: add printf attribute to llapi_{printf,error}
                  fa16465 LU-3497 build: Use alt. path for ZFS development headers
                  1f63b38 LU-2800 autoconf: remove LC_SECURITY_PLUG test
                  aa41bab LU-3768 tunefs: make tunefs.lustre work correctly for MGS
                  5375158 LU-3787 test: Missing "$" on variable
          g       7b28134 LU-2484 obd: add md_stats to MDC and LMV devices
                  1c9d076 LU-2675 cleanup: define sparse annotations for libcfs
                  50c7a76 LU-2253 tests: get proper free indoes
                  5cb3841 LU-3720 test: use actual stripe count in sanity.sh 130d
                  a5b4279 LU-3703 tests: skip getfattr part of sanity test 234
                  aeb887a LU-3763 utils: set multipath devices recursively
                  d8179d2 LU-3508 test: small fix for sanity.sh test_101c
                  862cdaa LU-3097 build: fix 'no effect' errors
                  842a632 LU-3647 hsm: Add support to drop all pages for ll_data_version
                  8d8f071 LU-3561 tests: improve sanity-hsm.sh to support remote agent
          g7      3f92a01 LU-1346 libcfs: cleanup cfs_curproc_xxx macros
          

          So the bad patch is almost certainly between 3f92a01..1686463, and maybe between 7b28134..eb22854 though there isn't enough information to be certain (the one pass at 7b28134 may have been a fluke).

          adilger Andreas Dilger added a comment - - edited I did a query in maloo to list all of the recent sanity-hsm tests: Name: sanity-hsm Test group: review Run within: 1 week Lustre branch: master which has the following nasty URL that can't be shortened to drop the useless parts: https://maloo.whamcloud.com/test_sets/query%26test_set[test_set_script_id]=10d5ab1c-78af-11e2-9928-52540035b04c&test_set[status]=&test_set[query_bugs]=&test_session[test_host]=&test_session[test_group]=review&test_session[user_id]=&test_session[query_date]=&test_session[query_recent_period]=604800&test_node[os_type_id]=&test_node[distribution_type_id]=&test_node[architecture_type_id]=&test_node[file_system_type_id]=&test_node[lustre_branch_id]=24a6947e-04a9-11e1-bb5f-52540025f9af&test_node_network[network_type_id]=&commit=Update+results Using " git bisect good [hash] " and " git bisect bad [hash] " only goes so far when there is an intermittent failure. Instead, I just made a simple histogram using the edited output of " git log --pretty=online --abbrev-commit | head -60" to get a list of the recently committed patches. Then I started looking through the results of the maloo query, clicking on the " gerrit:NNNN " link for each test (pass or fail) and checking the " Parent(s): " line for each patch to see what commit the patch was based on. If there is ambiguity about this, the Maloo test results have the actual commit hash of the patch that was run that should match the has next to the " Patch Set N " line. This is what I got, marking the number of (g)ood and (B)ad runs on each line: B6 23c1979 LU-3811 hsm: handle file ownership and timestamps 325ce4b LU-3711 mdt: default archive id is tunable g B5 536f1a6 LU-3500 mdt: use __u64 for sp_cr_flags B6 036f0c6 LU-3788 man: add "lctl set_param -P" to lctl.8 man page g4 B14 1686463 New tag 2.4.92 143d09c LU-2800 autoconf: remove LC_RW_TREE_LOCK test 9699de5 LU-3841 lod: handle released defined layouts B2 0067260 LU-2800 autoconf: update LC_INODE_PERMISION_2ARGS test 07e6e38 LU-2800 autoconf: remove LC_FS_RENAME_DOES_D_MOVE test b50f730 LU-3789 mgs: Add deprecation warning for "lctl conf_param" B2 eb22854 LU-3581 osc: Lustre returns EINTR from writes when SA_RESTART is set 3b18568 LU-3549 llapi: add printf attribute to llapi_{printf,error} fa16465 LU-3497 build: Use alt. path for ZFS development headers 1f63b38 LU-2800 autoconf: remove LC_SECURITY_PLUG test aa41bab LU-3768 tunefs: make tunefs.lustre work correctly for MGS 5375158 LU-3787 test: Missing "$" on variable g 7b28134 LU-2484 obd: add md_stats to MDC and LMV devices 1c9d076 LU-2675 cleanup: define sparse annotations for libcfs 50c7a76 LU-2253 tests: get proper free indoes 5cb3841 LU-3720 test: use actual stripe count in sanity.sh 130d a5b4279 LU-3703 tests: skip getfattr part of sanity test 234 aeb887a LU-3763 utils: set multipath devices recursively d8179d2 LU-3508 test: small fix for sanity.sh test_101c 862cdaa LU-3097 build: fix 'no effect' errors 842a632 LU-3647 hsm: Add support to drop all pages for ll_data_version 8d8f071 LU-3561 tests: improve sanity-hsm.sh to support remote agent g7 3f92a01 LU-1346 libcfs: cleanup cfs_curproc_xxx macros So the bad patch is almost certainly between 3f92a01..1686463, and maybe between 7b28134..eb22854 though there isn't enough information to be certain (the one pass at 7b28134 may have been a fluke).

          It looks like a bad patch landed between 2013-09-02 and 2013-09-04. The first patch that fails 44/90 tests in sanity-hsm is
          http://review.whamcloud.com/7490 with parent commit eb2285474 (LU-3581 osc: Lustre returns EINTR from writes when SA_RESTART is set), so it is possibly this parent or a patch that landed shortly before this. You can use a Maloo query to get the list of sanity-hsm failures, and then check which parents are consistently passing, and which ones are causing failures.

          adilger Andreas Dilger added a comment - It looks like a bad patch landed between 2013-09-02 and 2013-09-04. The first patch that fails 44/90 tests in sanity-hsm is http://review.whamcloud.com/7490 with parent commit eb2285474 ( LU-3581 osc: Lustre returns EINTR from writes when SA_RESTART is set), so it is possibly this parent or a patch that landed shortly before this. You can use a Maloo query to get the list of sanity-hsm failures, and then check which parents are consistently passing, and which ones are causing failures.

          There is lots of ldiskfs testing failing right now in maloo. Moved back to blocker.

          keith Keith Mannthey (Inactive) added a comment - There is lots of ldiskfs testing failing right now in maloo. Moved back to blocker.
          bobbielind Bobbie Lind (Inactive) added a comment - More failures here https://maloo.whamcloud.com/test_sets/b32b57f2-1627-11e3-a84e-52540035b04c

          Ldiskfs sighting:

          https://maloo.whamcloud.com/test_sets/074edea0-14e6-11e3-8ae7-52540035b04c

          Test 8 + Lots of other sub tests failed in the same way.

          keith Keith Mannthey (Inactive) added a comment - Ldiskfs sighting: https://maloo.whamcloud.com/test_sets/074edea0-14e6-11e3-8ae7-52540035b04c Test 8 + Lots of other sub tests failed in the same way.

          Drop the priority since the failure are most for zfs.

          jay Jinshan Xiong (Inactive) added a comment - Drop the priority since the failure are most for zfs.

          This looks a lot like most of the failing tests for zfs in sanity-hsm. hsmtool doesn't seem to be functioning.

          utopiabound Nathaniel Clark added a comment - This looks a lot like most of the failing tests for zfs in sanity-hsm. hsmtool doesn't seem to be functioning.

          People

            jhammond John Hammond
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: