Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.10.1, Lustre 2.11.0
-
None
-
3
-
9223372036854775807
Description
sanity-hsm test_31a fails because the test waits 200 seconds for the state to change from RESTORE to SUCCEED. From the test log:
size seen is right: 32000000 == 32000000
restore is too long
CMD: trevis-42vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000404:0xd3:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: trevis-42vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000404:0xd3:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 200 secs for update
…
CMD: trevis-42vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000404:0xd3:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: trevis-42vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000404:0xd3:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Update not seen after 200s: wanted 'SUCCEED' got ''
sanity-hsm test_31a: @@@@@@ FAIL: request on [0x200000404:0xd3:0x0] is not SUCCEED on mds1
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:5291:error()
= /usr/lib64/lustre/tests/sanity-hsm.sh:722:wait_request_state()
= /usr/lib64/lustre/tests/sanity-hsm.sh:2908:restore_and_check_size()
= /usr/lib64/lustre/tests/sanity-hsm.sh:2925:test_31a()
From the copytool log, we can see that the file or directory does not exist
1504662572.818431 lhsmtool_posix[8903]: progress ioctl for copy '/tmp/arc1/shsm/00d3/0000/0404/0000/0002/0000/0x200000404:0xd3:0x0'->'{VOLATILE}=[0x200000403:0x83:0x0]' failed: No such file or directory (2)
1504662572.818447 lhsmtool_posix[8903]: copied 32000000 bytes in 0.007760 seconds
1504662572.818453 lhsmtool_posix[8903]: cannot copy data from '/tmp/arc1/shsm/00d3/0000/0404/0000/0002/0000/0x200000404:0xd3:0x0' to '{VOLATILE}=[0x200000403:0x83:0x0]': No such file or directory (2)
1504662572.818465 lhsmtool_posix[8903]: Action completed, notifying coordinator cookie=0x59af5427, FID=[0x200000404:0xd3:0x0], hp_flags=0 err=2
1504662572.818960 lhsmtool_posix[8903]: llapi_hsm_action_end() on '/mnt/lustre2/.lustre/fid/0x200000404:0xd3:exiting: Terminated
file or directory (2)
exiting: Terminated
From the MDS console, we see two error messages from the mdt_coordinator; the first error message is seen in logs where the test passes, but the second error message is, I think, a real error
01:51:44:[12922.396156] Lustre: DEBUG MARKER: == sanity-hsm test 31a: Import a large file and check size during restore ============================ 01:49:28 (1504662568)
01:51:44:[12923.977685] Lustre: HSM agent 809e6962-34fe-a02a-1cb7-6b8a8eba00b0 already registered
01:51:44:[12924.998864] LustreError: 10237:0:(mdt_coordinator.c:1535:mdt_hsm_update_request_state()) lustre-MDT0000: Progress on [0x200000403:0x82:0x0] for cookie 0x59af5427 does not match request FID [0x200000404:0xd3:0x0] nor data FID [0x200000404:0xd3:0x0]
01:51:44:[12925.005604] LustreError: 8149:0:(mdt_coordinator.c:1505:mdt_hsm_update_request_state()) lustre-MDT0000: Cannot find running request for cookie 0x59af5427 on fid=[0x200000403:0x83:0x0]
01:51:44:[13034.210227] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000404:0xd3:0x0]'.*action='RESTORE'/ {print $13}' | cut -f2 -d=
Logs for these failures are at
master:
https://testing.hpdd.intel.com/test_sets/13c63484-92ab-11e7-b67f-5254006e85c2
b2_10:
https://testing.hpdd.intel.com/test_sets/95218046-7f6f-11e7-96e0-5254006e85c2