Details
-
Technical task
-
Resolution: Fixed
-
Blocker
-
Lustre 2.5.0
-
9801
Description
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>
This issue relates to the following test suite runs:
http://maloo.whamcloud.com/test_sets/21cdd37a-094d-11e3-b004-52540035b04c
http://maloo.whamcloud.com/test_sets/2ee8358a-0931-11e3-b004-52540035b04c
http://maloo.whamcloud.com/test_sets/25a97004-094c-11e3-a9b0-52540035b04c
The sub-test test_8 failed with the following error:
request on 0x200008101:0x2:0x0 is not SUCCEED
Info required for matching: sanity-hsm 8
Attachments
Activity
From what I have seen, the copy process was successful, copying xattrs should be okay as well. So from the snippet of ct_archive(), it should fail around:
CT_TRACE("'%s' data archived to '%s' done\n", src, dst); /* attrs will remain on the MDS; no need to copy them, except possibly for disaster recovery */ if (opt.o_copy_attrs) { rc = ct_copy_attr(src, dst, src_fd, dst_fd); if (rc < 0) { CT_ERROR("'%s' attr copy failed to '%s' (%s)\n", src, dst, strerror(-rc)); rcf = rc; } CT_TRACE("'%s' attr file copied to archive '%s'\n", src, dst); } /* xattrs will remain on the MDS; no need to copy them, except possibly for disaster recovery */ if (opt.o_copy_xattrs) { rc = ct_copy_xattr(src, dst, src_fd, dst_fd, false); if (rc < 0) { CT_ERROR("'%s' xattr copy failed to '%s' (%s)\n", src, dst, strerror(-rc)); rcf = rcf ? rcf : rc; } CT_ERROR("'%s' xattr file copied to archive '%s'\n", src, dst); } if (rename_needed == true) { char tmp_src[PATH_MAX]; char tmp_dst[PATH_MAX]; /* atomically replace old archived file */ ct_path_archive(src, sizeof(src), opt.o_hsm_root, &hai->hai_fid); rc = rename(dst, src); if (rc < 0) { CT_ERROR("'%s' renamed to '%s' failed (%s)\n", dst, src, strerror(errno)); rc = -errno; goto fini_major; } /* rename lov file */ snprintf(tmp_src, sizeof(tmp_src), "%s.lov", src); snprintf(tmp_dst, sizeof(tmp_dst), "%s.lov", dst); rc = rename(tmp_dst, tmp_src); if (rc < 0) CT_ERROR("'%s' renamed to '%s' failed (%s)\n", tmp_dst, tmp_src, strerror(errno)); } if (opt.o_shadow_tree) { /* Create a namespace of softlinks that shadows the original * Lustre namespace. This will only be current at * time-of-archive (won't follow renames). * WARNING: release won't kill these links; a manual * cleanup of dead links would be required. */
I suspect that it failed at copying attr if the archive existed on NFS share as John stated.
I would like to rule out '8d8f071 LU-3561 tests: improve sanity-hsm.sh to support remote agent' first as this switches the archive directore from local /tmp/arc to global shared (/home/cgearing/.autotest/shared_dir/2013-09-04/191904-70339907832920/arc1) which I assume NFS shared to the test VMs.
I did a query in maloo to list all of the recent sanity-hsm tests:
Name: sanity-hsm Test group: review Run within: 1 week Lustre branch: master
which has the following nasty URL that can't be shortened to drop the useless parts:
Using "git bisect good [hash]" and "git bisect bad [hash]" only goes so far when there is an intermittent failure. Instead, I just made a simple histogram using the edited output of "git log --pretty=online --abbrev-commit | head -60" to get a list of the recently committed patches. Then I started looking through the results of the maloo query, clicking on the "gerrit:NNNN" link for each test (pass or fail) and checking the "Parent(s):" line for each patch to see what commit the patch was based on. If there is ambiguity about this, the Maloo test results have the actual commit hash of the patch that was run that should match the has next to the "Patch Set N" line.
This is what I got, marking the number of (g)ood and (B)ad runs on each line:
B6 23c1979 LU-3811 hsm: handle file ownership and timestamps 325ce4b LU-3711 mdt: default archive id is tunable g B5 536f1a6 LU-3500 mdt: use __u64 for sp_cr_flags B6 036f0c6 LU-3788 man: add "lctl set_param -P" to lctl.8 man page g4 B14 1686463 New tag 2.4.92 143d09c LU-2800 autoconf: remove LC_RW_TREE_LOCK test 9699de5 LU-3841 lod: handle released defined layouts B2 0067260 LU-2800 autoconf: update LC_INODE_PERMISION_2ARGS test 07e6e38 LU-2800 autoconf: remove LC_FS_RENAME_DOES_D_MOVE test b50f730 LU-3789 mgs: Add deprecation warning for "lctl conf_param" B2 eb22854 LU-3581 osc: Lustre returns EINTR from writes when SA_RESTART is set 3b18568 LU-3549 llapi: add printf attribute to llapi_{printf,error} fa16465 LU-3497 build: Use alt. path for ZFS development headers 1f63b38 LU-2800 autoconf: remove LC_SECURITY_PLUG test aa41bab LU-3768 tunefs: make tunefs.lustre work correctly for MGS 5375158 LU-3787 test: Missing "$" on variable g 7b28134 LU-2484 obd: add md_stats to MDC and LMV devices 1c9d076 LU-2675 cleanup: define sparse annotations for libcfs 50c7a76 LU-2253 tests: get proper free indoes 5cb3841 LU-3720 test: use actual stripe count in sanity.sh 130d a5b4279 LU-3703 tests: skip getfattr part of sanity test 234 aeb887a LU-3763 utils: set multipath devices recursively d8179d2 LU-3508 test: small fix for sanity.sh test_101c 862cdaa LU-3097 build: fix 'no effect' errors 842a632 LU-3647 hsm: Add support to drop all pages for ll_data_version 8d8f071 LU-3561 tests: improve sanity-hsm.sh to support remote agent g7 3f92a01 LU-1346 libcfs: cleanup cfs_curproc_xxx macros
So the bad patch is almost certainly between 3f92a01..1686463, and maybe between 7b28134..eb22854 though there isn't enough information to be certain (the one pass at 7b28134 may have been a fluke).
It looks like a bad patch landed between 2013-09-02 and 2013-09-04. The first patch that fails 44/90 tests in sanity-hsm is
http://review.whamcloud.com/7490 with parent commit eb2285474 (LU-3581 osc: Lustre returns EINTR from writes when SA_RESTART is set), so it is possibly this parent or a patch that landed shortly before this. You can use a Maloo query to get the list of sanity-hsm failures, and then check which parents are consistently passing, and which ones are causing failures.
There is lots of ldiskfs testing failing right now in maloo. Moved back to blocker.
More failures here
https://maloo.whamcloud.com/test_sets/b32b57f2-1627-11e3-a84e-52540035b04c
Ldiskfs sighting:
https://maloo.whamcloud.com/test_sets/074edea0-14e6-11e3-8ae7-52540035b04c
Test 8 + Lots of other sub tests failed in the same way.
This looks a lot like most of the failing tests for zfs in sanity-hsm. hsmtool doesn't seem to be functioning.
I just did an experiment on rosso and it verified my guess, as follows:
I did the same thing on toro and it worked so this is why the tests only failed sometimes. If the test was running on rosso, it failed. pretty nasty, huh?