Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7136

sanity-hsm test_12q failed with 'Copytool should have stopped'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.8.0
    • Lustre 2.8.0
    • autotest
    • 3
    • 9223372036854775807

    Description

      sanity-hsm test 12q failed with 'Copytool should have stopped'. From the test log we can see that the archive succeeded and then:

      Updated after 2s: wanted 'SUCCEED' got 'SUCCEED'
      Killing existing copytools on shadow-39vm5
      CMD: shadow-39vm5 killall -q lhsmtool_posix
      CMD: shadow-39vm5 pgrep -x lhsmtool_posix
      shadow-39vm5: 16199
       sanity-hsm test_12q: @@@@@@ FAIL: Copytool should have stopped 
      

      It’s not clear what the is causing the problem, but there are differences in the copytool logs from a successful run and this failure. From the copytool log on a failed test, we see that the action is ARCHIVE

      1441818057.605721 lhsmtool_posix[16199]: waiting for message from kernel
      1441818057.605768 lhsmtool_posix[16200]: '[0x200000402:0x12:0x0]' action ARCHIVE reclen 72, cookie=0x55f065c2
      1441818057.609298 lhsmtool_posix[16200]: processing file 'd12q.sanity-hsm/f12q.sanity-hsm'
      1441818057.618552 lhsmtool_posix[16200]: archiving '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' to '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp'
      1441818057.620731 lhsmtool_posix[16200]: saving stripe info of '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' in /tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp.lov
      1441818057.624113 lhsmtool_posix[16200]: start copy of 2097152 bytes from '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' to '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp'
      1441818059.715448 lhsmtool_posix[16200]: copied 2097152 bytes in 2.094523 seconds
      1441818059.841860 lhsmtool_posix[16200]: data archiving for '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' to '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' done
      1441818059.842153 lhsmtool_posix[16200]: attr file for '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' saved to archive '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp'
      1441818059.844232 lhsmtool_posix[16200]: fsetxattr of 'trusted.hsm' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=0 (Success)
      1441818059.844289 lhsmtool_posix[16200]: fsetxattr of 'trusted.version' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=0 (Success)
      1441818059.844356 lhsmtool_posix[16200]: fsetxattr of 'trusted.link' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=0 (Success)
      1441818059.844403 lhsmtool_posix[16200]: fsetxattr of 'trusted.lov' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=0 (Success)
      1441818059.844442 lhsmtool_posix[16200]: fsetxattr of 'trusted.lma' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=0 (Success)
      1441818059.844480 lhsmtool_posix[16200]: fsetxattr of 'lustre.lov' on '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp' rc=-1 (Operation not supported)
      1441818059.844493 lhsmtool_posix[16200]: xattr file for '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' saved to archive '/tmp/arc1/0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0_tmp'
      1441818059.846873 lhsmtool_posix[16200]: symlink '/tmp/arc1/shadow/d12q.sanity-hsm/f12q.sanity-hsm' to '../../0012/0000/0402/0000/0002/0000/0x200000402:0x12:0x0' done
      1441818059.848407 lhsmtool_posix[16200]: Action completed, notifying coordinator cookie=0x55f065c2, FID=[0x200000402:0x12:0x0], hp_flags=0 err=0
      1441818059.852138 lhsmtool_posix[16200]: llapi_hsm_action_end() on '/mnt/lustre3/.lustre/fid/0x200000402:0x12:0x0' ok (rc=0)
      exiting: Terminated
      

      On a successful run of test 12q (https://testing.hpdd.intel.com/sub_tests/4d322136-57d1-11e5-a084-5254006e85c2), the copytool logs does not show an ARCHIVE taking place, but only a RESTORE is being done:

      1441882443.432895 lhsmtool_posix[5788]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre3
      1441882443.441409 lhsmtool_posix[5789]: waiting for message from kernel
      1441882446.704674 lhsmtool_posix[5789]: copytool fs=lustre archive#=2 item_count=1
      1441882446.704837 lhsmtool_posix[5789]: waiting for message from kernel
      1441882446.705275 lhsmtool_posix[5790]: '[0x200000405:0x12:0x0]' action RESTORE reclen 72, cookie=0x55f16107
      1441882446.707572 lhsmtool_posix[5790]: processing file 'd12q.sanity-hsm/f12q.sanity-hsm'
      1441882446.707649 lhsmtool_posix[5790]: reading stripe rules from '/tmp/arc1/0012/0000/0405/0000/0002/0000/0x200000405:0x12:0x0.lov' for '/tmp/arc1/0012/0000/0405/0000/0002/0000/0x200000405:0x12:0x0'
      1441882446.722704 lhsmtool_posix[5790]: restoring data from '/tmp/arc1/0012/0000/0405/0000/0002/0000/0x200000405:0x12:0x0' to '{VOLATILE}=[0x200000408:0x1:0x0]'
      1441882446.729598 lhsmtool_posix[5790]: start copy of 2097152 bytes from '/tmp/arc1/0012/0000/0405/0000/0002/0000/0x200000405:0x12:0x0' to '{VOLATILE}=[0x200000408:0x1:0x0]'
      1441882448.752649 lhsmtool_posix[5790]: copied 2097152 bytes in 2.026376 seconds
      1441882448.752715 lhsmtool_posix[5790]: data restore from '/tmp/arc1/0012/0000/0405/0000/0002/0000/0x200000405:0x12:0x0' to '{VOLATILE}=[0x200000408:0x1:0x0]' done
      1441882448.752730 lhsmtool_posix[5790]: Action completed, notifying coordinator cookie=0x55f16107, FID=[0x200000405:0x12:0x0], hp_flags=0 err=0
      1441882448.836919 lhsmtool_posix[5790]: llapi_hsm_action_end() on '/mnt/lustre3/.lustre/fid/0x200000405:0x12:0x0' ok (rc=0)
      exiting: Interrupt
      

      Sanity-hsm test 12q does both an hsm_archive and a hsm_release.

      We’ve seen this test fail twice recently. Logs are at
      2015-09-09 16:53:45 – review-zfs- part-1 - https://testing.hpdd.intel.com/test_sets/ff615ed0-5727-11e5-84d0-5254006e85c2
      2015-09-05 23:31:22 - review-dne-part-2 - https://testing.hpdd.intel.com/test_sets/6c2cd9fe-5453-11e5-b25e-5254006e85c2

      Prior to that, sanity-hsm test 12q failed in the same way five times in August. Here is one of those logs:
      2015-08-08 03:39:47 - review-dne-part-2 - https://testing.hpdd.intel.com/test_sets/8724f110-3dab-11e5-be57-5254006e85c2

      Attachments

        Issue Links

          Activity

            People

              jhammond John Hammond
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: