Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4275

sanity-hsm test_8 and many more failed: 36/91 passed

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.6.0
    • None
    • 3
    • 11743

    Description

      This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

      This issue relates to the following test suite run:
      http://maloo.whamcloud.com/test_sets/192e2df2-50a4-11e3-a19b-52540035b04c
      https://maloo.whamcloud.com/test_sets/c29d2458-50a2-11e3-b42b-52540035b04c
      https://maloo.whamcloud.com/test_sets/9cc72cb2-4ffb-11e3-a56f-52540035b04c
      https://maloo.whamcloud.com/test_sets/49e9eb44-4f45-11e3-a56f-52540035b04c
      etc

      The sub-test test_8 failed with the following error:

      request on 0x200000401:0x2:0x0 is not SUCCEED

      In fact, Maloo is reporting only 36/91 tests passed. These failures are being attributed to LU-4114, but do not actually seem related

      Info required for matching: sanity-hsm 8

      Attachments

        Issue Links

          Activity

            [LU-4275] sanity-hsm test_8 and many more failed: 36/91 passed

            Duplicated by TEI-1208.

            bfaccini Bruno Faccini (Inactive) added a comment - Duplicated by TEI-1208.
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            There have been more sanity-hsm failures linked to this ticket and again/always during auto-tests runs on superfat-intel-1vm* only, and still due to the same EPERM error during fchmod()/fchown()/futimes() operations from copytool on NFS-mounted hsm-root filesystem.

            bfaccini Bruno Faccini (Inactive) added a comment - - edited There have been more sanity-hsm failures linked to this ticket and again/always during auto-tests runs on superfat-intel-1vm* only, and still due to the same EPERM error during fchmod()/fchown()/futimes() operations from copytool on NFS-mounted hsm-root filesystem.

            For each failing sub-tests, the copytool log looks like following :

            lhsmtool_posix[29866]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre
            lhsmtool_posix[29867]: waiting for message from kernel
            lhsmtool_posix[29867]: copytool fs=lustre archive#=1 item_count=1
            lhsmtool_posix[29867]: waiting for message from kernel
            lhsmtool_posix[30544]: '[0x200000401:0x2:0x0]' action ARCHIVE reclen 72, cookie=0x528a6d6b
            lhsmtool_posix[30544]: processing file 'd0.sanity-hsm/d8/f.sanity-hsm.8'
            lhsmtool_posix[30544]: archiving '/mnt/lustre/.lustre/fid/0x200000401:0x2:0x0' to '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp'
            lhsmtool_posix[30544]: saving stripe info of '/mnt/lustre/.lustre/fid/0x200000401:0x2:0x0' in /home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp.lov
            lhsmtool_posix[30544]: going to copy data from '/mnt/lustre/.lustre/fid/0x200000401:0x2:0x0' to '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp'
            lhsmtool_posix[30544]: data archiving for '/mnt/lustre/.lustre/fid/0x200000401:0x2:0x0' to '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp' done
            lhsmtool_posix[30544]: cannot set attributes of '/mnt/lustre/.lustre/fid/0x200000401:0x2:0x0': Operation not permitted (1)
            lhsmtool_posix[30544]: cannot copy attr of '/mnt/lustre/.lustre/fid/0x200000401:0x2:0x0' to '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp': Operation not permitted (1)
            lhsmtool_posix[30544]: attr file for '/mnt/lustre/.lustre/fid/0x200000401:0x2:0x0' saved to archive '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp'
            lhsmtool_posix[30544]: fsetxattr of 'trusted.hsm' on '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp' rc=-1 (Operation not supported)
            lhsmtool_posix[30544]: fsetxattr of 'trusted.link' on '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp' rc=-1 (Operation not supported)
            lhsmtool_posix[30544]: fsetxattr of 'trusted.lov' on '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp' rc=-1 (Operation not supported)
            lhsmtool_posix[30544]: fsetxattr of 'trusted.lma' on '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp' rc=-1 (Operation not supported)
            lhsmtool_posix[30544]: fsetxattr of 'lustre.lov' on '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp' rc=-1 (Operation not supported)
            lhsmtool_posix[30544]: xattr file for '/mnt/lustre/.lustre/fid/0x200000401:0x2:0x0' saved to archive '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp'
            lhsmtool_posix[30544]: symlink '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/shadow/d0.sanity-hsm/d8/f.sanity-hsm.8' to '../../../0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0' done
            lhsmtool_posix[30544]: Action completed, notifying coordinator cookie=0x528a6d6b, FID=[0x200000401:0x2:0x0], hp_flags=0 err=1
            lhsmtool_posix[30544]: llapi_hsm_action_end() on '/mnt/lustre/.lustre/fid/0x200000401:0x2:0x0' ok (rc=0)
            exiting: Interrupt
            

            This means that the FAILED status returned for the sub-tests HSM actions comes from errors during files operations on the hsm-root side/filesystem.

            bfaccini Bruno Faccini (Inactive) added a comment - For each failing sub-tests, the copytool log looks like following : lhsmtool_posix[29866]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre lhsmtool_posix[29867]: waiting for message from kernel lhsmtool_posix[29867]: copytool fs=lustre archive#=1 item_count=1 lhsmtool_posix[29867]: waiting for message from kernel lhsmtool_posix[30544]: '[0x200000401:0x2:0x0]' action ARCHIVE reclen 72, cookie=0x528a6d6b lhsmtool_posix[30544]: processing file 'd0.sanity-hsm/d8/f.sanity-hsm.8' lhsmtool_posix[30544]: archiving '/mnt/lustre/.lustre/fid/0x200000401:0x2:0x0' to '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp' lhsmtool_posix[30544]: saving stripe info of '/mnt/lustre/.lustre/fid/0x200000401:0x2:0x0' in /home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp.lov lhsmtool_posix[30544]: going to copy data from '/mnt/lustre/.lustre/fid/0x200000401:0x2:0x0' to '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp' lhsmtool_posix[30544]: data archiving for '/mnt/lustre/.lustre/fid/0x200000401:0x2:0x0' to '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp' done lhsmtool_posix[30544]: cannot set attributes of '/mnt/lustre/.lustre/fid/0x200000401:0x2:0x0': Operation not permitted (1) lhsmtool_posix[30544]: cannot copy attr of '/mnt/lustre/.lustre/fid/0x200000401:0x2:0x0' to '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp': Operation not permitted (1) lhsmtool_posix[30544]: attr file for '/mnt/lustre/.lustre/fid/0x200000401:0x2:0x0' saved to archive '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp' lhsmtool_posix[30544]: fsetxattr of 'trusted.hsm' on '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[30544]: fsetxattr of 'trusted.link' on '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[30544]: fsetxattr of 'trusted.lov' on '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[30544]: fsetxattr of 'trusted.lma' on '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[30544]: fsetxattr of 'lustre.lov' on '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp' rc=-1 (Operation not supported) lhsmtool_posix[30544]: xattr file for '/mnt/lustre/.lustre/fid/0x200000401:0x2:0x0' saved to archive '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0_tmp' lhsmtool_posix[30544]: symlink '/home/chris/.autotest/shared_dir/2013-11-18/031819-70364245912640/arc1/shadow/d0.sanity-hsm/d8/f.sanity-hsm.8' to '../../../0002/0000/0401/0000/0002/0000/0x200000401:0x2:0x0' done lhsmtool_posix[30544]: Action completed, notifying coordinator cookie=0x528a6d6b, FID=[0x200000401:0x2:0x0], hp_flags=0 err=1 lhsmtool_posix[30544]: llapi_hsm_action_end() on '/mnt/lustre/.lustre/fid/0x200000401:0x2:0x0' ok (rc=0) exiting: Interrupt This means that the FAILED status returned for the sub-tests HSM actions comes from errors during files operations on the hsm-root side/filesystem.
            adilger Andreas Dilger added a comment - - edited

            Chris, Joshua, Mike,
            The superfat-intel-1vm{1,5} nodes are causing 2/3 of all the sanity-hsm test failures. It looks like these nodes have not passed a single full test session all week due to a configuration issue with sanity-hsm.

            Please remove these nodes from the normal test rotation, since they are only causing tests to fail and need to be resubmitted. They could be left for developers to reserve, or (if possible) set to run b2_1-b2_4 tests only.

            adilger Andreas Dilger added a comment - - edited Chris, Joshua, Mike, The superfat-intel-1vm{1,5} nodes are causing 2/3 of all the sanity-hsm test failures. It looks like these nodes have not passed a single full test session all week due to a configuration issue with sanity-hsm. Please remove these nodes from the normal test rotation, since they are only causing tests to fail and need to be resubmitted. They could be left for developers to reserve, or (if possible) set to run b2_1-b2_4 tests only.
            adilger Andreas Dilger added a comment - - edited

            Still causing 16/40 of the sanity-hsm test failures. It looks like it ALWAYS and ONLY fails on superfat-intel-1vm

            {1,5}

            (most recent pass on 2013-11-15, https://maloo.whamcloud.com/test_sets/e1d47c4e-4e19-11e3-a167-52540035b04c).

            Without having looked into it, I'd guess there is something wrong with configuring the NFS "archive" for these nodes? If the problem cannot be fixed quickly can these nodes be removed from the test queue, or configured so they only run e.g. b2_1 tests that do not need sanity-hsm?

            adilger Andreas Dilger added a comment - - edited Still causing 16/40 of the sanity-hsm test failures. It looks like it ALWAYS and ONLY fails on superfat-intel-1vm {1,5} (most recent pass on 2013-11-15, https://maloo.whamcloud.com/test_sets/e1d47c4e-4e19-11e3-a167-52540035b04c ). Without having looked into it, I'd guess there is something wrong with configuring the NFS "archive" for these nodes? If the problem cannot be fixed quickly can these nodes be removed from the test queue, or configured so they only run e.g. b2_1 tests that do not need sanity-hsm?

            This caused 16 of 50 recent sanity-hsm test failures, so it is pretty important to fix. Since it is causing 55 separate tests to fail, it cannot be fixed by simply skipping a single failing test.

            adilger Andreas Dilger added a comment - This caused 16 of 50 recent sanity-hsm test failures, so it is pretty important to fix. Since it is causing 55 separate tests to fail, it cannot be fixed by simply skipping a single failing test.

            People

              bfaccini Bruno Faccini (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: