Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9991

sanity-hsm test 31c fails with 'copytools failed to stop'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      sanity-hsm test_31c fails to archive the newly created file:

      Update not seen after 200s: wanted 'SUCCEED' got 'STARTED'
       sanity-hsm test_31c: @@@@@@ FAIL: request on 0x200000402:0x38:0x0 is not SUCCEED on mds1 
      

      and then we can shutdown the copytool:

      copytools still running on trevis-51vm6
      CMD: trevis-51vm6 pgrep -x lhsmtool_posix
      trevis-51vm6: 6036
      copytools still running on trevis-51vm6
      CMD: trevis-51vm6 echo 1 >/proc/sys/kernel/sysrq ;  echo t >/proc/sysrq-trigger
      copytools failed to stop in 200s
       sanity-hsm test_31c: @@@@@@ FAIL: copytools failed to stop 
      

      From the copytool logs, the archive was progressing and it doesn’t look like it exceeded 200 seconds unless the copytool was hung:

      1505294662.538461 lhsmtool_posix[6038]: '[0x200000402:0x38:0x0]' action ARCHIVE reclen 72, cookie=0x59b8f942
      1505294662.539849 lhsmtool_posix[6038]: processing file 'd31c.sanity-hsm/f31c.sanity-hsm'
      1505294662.555950 lhsmtool_posix[6038]: archiving '/mnt/lustre2/.lustre/fid/0x200000402:0x38:0x0' to '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp'
      1505294662.556901 lhsmtool_posix[6038]: saving stripe info of '/mnt/lustre2/.lustre/fid/0x200000402:0x38:0x0' in /tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp.lov
      1505294662.558302 lhsmtool_posix[6038]: start copy of 34603008 bytes from '/mnt/lustre2/.lustre/fid/0x200000402:0x38:0x0' to '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp'
      1505294692.836074 lhsmtool_posix[6038]: %90 
      1505294692.838435 lhsmtool_posix[6038]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s
      1505294695.842294 lhsmtool_posix[6038]: copied 34603008 bytes in 33.285317 seconds
      1505294695.896192 lhsmtool_posix[6038]: data archiving for '/mnt/lustre2/.lustre/fid/0x200000402:0x38:0x0' to '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' done
      1505294695.896379 lhsmtool_posix[6038]: attr file for '/mnt/lustre2/.lustre/fid/0x200000402:0x38:0x0' saved to archive '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp'
      1505294695.897327 lhsmtool_posix[6038]: fsetxattr of 'trusted.hsm' on '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' rc=0 (Success)
      1505294695.897360 lhsmtool_posix[6038]: fsetxattr of 'trusted.version' on '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' rc=0 (Success)
      1505294695.897402 lhsmtool_posix[6038]: fsetxattr of 'trusted.link' on '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' rc=0 (Success)
      1505294695.897432 lhsmtool_posix[6038]: fsetxattr of 'trusted.lov' on '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' rc=0 (Success)
      1505294695.897476 lhsmtool_posix[6038]: fsetxattr of 'trusted.lma' on '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' rc=0 (Success)
      1505294695.897515 lhsmtool_posix[6038]: fsetxattr of 'lustre.lov' on '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' rc=-1 (Operation not supported)
      1505294695.897531 lhsmtool_posix[6038]: xattr file for '/mnt/lustre2/.lustre/fid/0x200000402:0x38:0x0' saved to archive '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp'
      1505294695.898946 lhsmtool_posix[6038]: symlink '/tmp/arc1/shsm/shadow/d31c.sanity-hsm/f31c.sanity-hsm' to '../../0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0' done
      exiting: Terminated
      

      So far, we’ve only see this ‘copytool failed to stop’ error for one patch test session. Logs for this failure are at
      https://testing.hpdd.intel.com/test_sets/3e2f0620-9872-11e7-b775-5254006e85c2

      We’ve seen the fail to archive a file several times for this test and the failure has at least once been attributed to LU-7988.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: