Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9953

sanity-hsm test_31a fails with 'request on [0x200000404:0xd3:0x0] is not SUCCEED on mds1'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.1, Lustre 2.11.0
    • None
    • 3
    • 9223372036854775807

    Description

      sanity-hsm test_31a fails because the test waits 200 seconds for the state to change from RESTORE to SUCCEED. From the test log:

      size seen is right: 32000000 == 32000000
       restore is too long
      CMD: trevis-42vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000404:0xd3:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
      CMD: trevis-42vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000404:0xd3:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
      Waiting 200 secs for update
      …
      CMD: trevis-42vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000404:0xd3:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
      CMD: trevis-42vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000404:0xd3:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
      Update not seen after 200s: wanted 'SUCCEED' got ''
       sanity-hsm test_31a: @@@@@@ FAIL: request on [0x200000404:0xd3:0x0] is not SUCCEED on mds1 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:5291:error()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:722:wait_request_state()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:2908:restore_and_check_size()
        = /usr/lib64/lustre/tests/sanity-hsm.sh:2925:test_31a()
      

      From the copytool log, we can see that the file or directory does not exist

      1504662572.818431 lhsmtool_posix[8903]: progress ioctl for copy '/tmp/arc1/shsm/00d3/0000/0404/0000/0002/0000/0x200000404:0xd3:0x0'->'{VOLATILE}=[0x200000403:0x83:0x0]' failed: No such file or directory (2)
      1504662572.818447 lhsmtool_posix[8903]: copied 32000000 bytes in 0.007760 seconds
      1504662572.818453 lhsmtool_posix[8903]: cannot copy data from '/tmp/arc1/shsm/00d3/0000/0404/0000/0002/0000/0x200000404:0xd3:0x0' to '{VOLATILE}=[0x200000403:0x83:0x0]': No such file or directory (2)
      1504662572.818465 lhsmtool_posix[8903]: Action completed, notifying coordinator cookie=0x59af5427, FID=[0x200000404:0xd3:0x0], hp_flags=0 err=2
      1504662572.818960 lhsmtool_posix[8903]: llapi_hsm_action_end() on '/mnt/lustre2/.lustre/fid/0x200000404:0xd3:exiting: Terminated
       file or directory (2)
      exiting: Terminated
      

      From the MDS console, we see two error messages from the mdt_coordinator; the first error message is seen in logs where the test passes, but the second error message is, I think, a real error

      01:51:44:[12922.396156] Lustre: DEBUG MARKER: == sanity-hsm test 31a: Import a large file and check size during restore ============================ 01:49:28 (1504662568)
      01:51:44:[12923.977685] Lustre: HSM agent 809e6962-34fe-a02a-1cb7-6b8a8eba00b0 already registered
      01:51:44:[12924.998864] LustreError: 10237:0:(mdt_coordinator.c:1535:mdt_hsm_update_request_state()) lustre-MDT0000: Progress on [0x200000403:0x82:0x0] for cookie 0x59af5427 does not match request FID [0x200000404:0xd3:0x0] nor data FID [0x200000404:0xd3:0x0]
      01:51:44:[12925.005604] LustreError: 8149:0:(mdt_coordinator.c:1505:mdt_hsm_update_request_state()) lustre-MDT0000: Cannot find running request for cookie 0x59af5427 on fid=[0x200000403:0x83:0x0]
      01:51:44:[13034.210227] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000404:0xd3:0x0]'.*action='RESTORE'/ {print $13}' | cut -f2 -d=
      

      Logs for these failures are at
      master:
      https://testing.hpdd.intel.com/test_sets/13c63484-92ab-11e7-b67f-5254006e85c2

      b2_10:
      https://testing.hpdd.intel.com/test_sets/95218046-7f6f-11e7-96e0-5254006e85c2

      Attachments

        Activity

          People

            wc-triage WC Triage
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: