Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14208

sanity-hsm test 55 fails with 'request on 0xM:0xN:0x0 is not FAILED on mds1'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.14.0
    • RHEL8.3 severs/clients
    • 3
    • 9223372036854775807

    Description

      sanity-hsm test_55 fails with 'request on 0x200001b79:0x4:0x0 is not FAILED on mds1' for RHEL 8.3 client/server testing for review-dne-zfs-part-2 and review-dne-part-2 (ldiskfs). Although test 55 has failed many times with this message, the failures for the REHL8.3 testing have a consistent copytool_log where the copy is stuck at 8% (“%8”?):

      lhsmtool_posix: 1607364549.271481 lhsmtool_posix[152920]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre2
      lhsmtool_posix: 1607364549.279162 lhsmtool_posix[152922]: waiting for message from kernel
      lhsmtool_posix: 1607364549.852166 lhsmtool_posix[152922]: copytool fs=lustre archive#=2 item_count=1
      lhsmtool_posix: 1607364549.852331 lhsmtool_posix[152922]: waiting for message from kernel
      lhsmtool_posix: 1607364549.852363 lhsmtool_posix[152923]: '[0x200001b79:0x4:0x0]' action ARCHIVE reclen 72, cookie=0x5fce6f8f
      lhsmtool_posix: 1607364549.853177 lhsmtool_posix[152923]: processing file 'd55.sanity-hsm/f55.sanity-hsm'
      lhsmtool_posix: 1607364549.868566 lhsmtool_posix[152923]: archiving '/mnt/lustre2/.lustre/fid/0x200001b79:0x4:0x0' to '/tmp/arc1/sanity-hsm.test_55//0004/0000/1b79/0000/0002/0000/0x200001b79:0x4:0x0_tmp'
      lhsmtool_posix: 1607364549.868732 lhsmtool_posix[152923]: saving stripe info of '/mnt/lustre2/.lustre/fid/0x200001b79:0x4:0x0' in /tmp/arc1/sanity-hsm.test_55//0004/0000/1b79/0000/0002/0000/0x200001b79:0x4:0x0_tmp.lov
      lhsmtool_posix: 1607364549.870442 lhsmtool_posix[152923]: start copy of 39000000 bytes from '/mnt/lustre2/.lustre/fid/0x200001b79:0x4:0x0' to '/tmp/arc1/sanity-hsm.test_55//0004/0000/1b79/0000/0002/0000/0x200001b79:0x4:0x0_tmp'
      lhsmtool_posix: 1607364579.001120 lhsmtool_posix[152923]: %8 
      lhsmtool_posix: 1607364609.001120 lhsmtool_posix[152923]: %8 
      lhsmtool_posix: 1607364639.000122 lhsmtool_posix[152923]: %8 
      lhsmtool_posix: 1607364669.001121 lhsmtool_posix[152923]: %8 
      lhsmtool_posix: 1607364699.001121 lhsmtool_posix[152923]: %8 
      lhsmtool_posix: 1607364729.001120 lhsmtool_posix[152923]: %8 
      lhsmtool_posix: 1607364759.001121 lhsmtool_posix[152923]: %8 
      exiting: Terminated
      

      The output from the client looks as expected for these failures, from the suite_log, write the file, start the archive and wait for the expected state which, in this case, is “FAILED”

      == sanity-hsm test 55: Truncate during an archive cancels it ========================================= 18:09:06 (1607364546)
      39+0 records in
      39+0 records out
      39000000 bytes (39 MB, 37 MiB) copied, 0.146523 s, 266 MB/s
      CMD: trevis-202vm3 mkdir -p /tmp/arc1/sanity-hsm.test_55/
      Starting copytool agt1 on trevis-202vm3
      CMD: trevis-202vm3 lhsmtool_posix  --daemon --hsm-root "/tmp/arc1/sanity-hsm.test_55/" --bandwidth 1 "/mnt/lustre2" < /dev/null > "/autotest/autotest-1/2020-12-07/lustre-reviews_review-dne-zfs-part-2_78162_1_108_a473c2cb-2a7d-4640-8bee-2a175a1428d2/sanity-hsm.test_55.copytool_log.trevis-202vm3.log" 2>&1
      CMD: trevis-202vm5 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200001b79:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
      CMD: trevis-202vm5 /usr/sbin/lctl set_param mdt.lustre-MDT0000.hsm.policy=+NRA
      mdt.lustre-MDT0000.hsm.policy=+NRA
      CMD: trevis-202vm6 /usr/sbin/lctl set_param mdt.lustre-MDT0001.hsm.policy=+NRA
      mdt.lustre-MDT0001.hsm.policy=+NRA
      CMD: trevis-202vm5 /usr/sbin/lctl set_param mdt.lustre-MDT0002.hsm.policy=+NRA
      mdt.lustre-MDT0002.hsm.policy=+NRA
      CMD: trevis-202vm6 /usr/sbin/lctl set_param mdt.lustre-MDT0003.hsm.policy=+NRA
      mdt.lustre-MDT0003.hsm.policy=+NRA
      CMD: trevis-202vm5 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200001b79:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
      Waiting 200s for 'FAILED'
      CMD: trevis-202vm5 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200001b79:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
      …
      CMD: trevis-202vm5 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200001b79:0x4:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
      Update not seen after 200s: want 'FAILED' got 'STARTED'
       sanity-hsm test_55: @@@@@@ FAIL: request on 0x200001b79:0x4:0x0 is not FAILED on mds1 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6257:error()
        = /usr/lib64/lustre/tests/test-framework.sh:10395:wait_request_state()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:2865:test_55()
      

      Logs for these failures are at
      https://testing.whamcloud.com/test_sets/358b2b64-faf2-4877-b5af-3f8a6cb54155
      https://testing.whamcloud.com/test_sets/1f82471e-e5c3-4057-9064-928b06e616a1

      Attachments

        Issue Links

          Activity

            People

              jhammond John Hammond
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: