[LU-9953] sanity-hsm test_31a fails with 'request on [0x200000404:0xd3:0x0] is not SUCCEED on mds1' Created: 06/Sep/17 Updated: 07/Feb/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.1, Lustre 2.11.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
sanity-hsm test_31a fails because the test waits 200 seconds for the state to change from RESTORE to SUCCEED. From the test log: size seen is right: 32000000 == 32000000
restore is too long
CMD: trevis-42vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000404:0xd3:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: trevis-42vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000404:0xd3:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 200 secs for update
…
CMD: trevis-42vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000404:0xd3:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: trevis-42vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000404:0xd3:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Update not seen after 200s: wanted 'SUCCEED' got ''
sanity-hsm test_31a: @@@@@@ FAIL: request on [0x200000404:0xd3:0x0] is not SUCCEED on mds1
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:5291:error()
= /usr/lib64/lustre/tests/sanity-hsm.sh:722:wait_request_state()
= /usr/lib64/lustre/tests/sanity-hsm.sh:2908:restore_and_check_size()
= /usr/lib64/lustre/tests/sanity-hsm.sh:2925:test_31a()
From the copytool log, we can see that the file or directory does not exist 1504662572.818431 lhsmtool_posix[8903]: progress ioctl for copy '/tmp/arc1/shsm/00d3/0000/0404/0000/0002/0000/0x200000404:0xd3:0x0'->'{VOLATILE}=[0x200000403:0x83:0x0]' failed: No such file or directory (2)
1504662572.818447 lhsmtool_posix[8903]: copied 32000000 bytes in 0.007760 seconds
1504662572.818453 lhsmtool_posix[8903]: cannot copy data from '/tmp/arc1/shsm/00d3/0000/0404/0000/0002/0000/0x200000404:0xd3:0x0' to '{VOLATILE}=[0x200000403:0x83:0x0]': No such file or directory (2)
1504662572.818465 lhsmtool_posix[8903]: Action completed, notifying coordinator cookie=0x59af5427, FID=[0x200000404:0xd3:0x0], hp_flags=0 err=2
1504662572.818960 lhsmtool_posix[8903]: llapi_hsm_action_end() on '/mnt/lustre2/.lustre/fid/0x200000404:0xd3:exiting: Terminated
file or directory (2)
exiting: Terminated
From the MDS console, we see two error messages from the mdt_coordinator; the first error message is seen in logs where the test passes, but the second error message is, I think, a real error 01:51:44:[12922.396156] Lustre: DEBUG MARKER: == sanity-hsm test 31a: Import a large file and check size during restore ============================ 01:49:28 (1504662568)
01:51:44:[12923.977685] Lustre: HSM agent 809e6962-34fe-a02a-1cb7-6b8a8eba00b0 already registered
01:51:44:[12924.998864] LustreError: 10237:0:(mdt_coordinator.c:1535:mdt_hsm_update_request_state()) lustre-MDT0000: Progress on [0x200000403:0x82:0x0] for cookie 0x59af5427 does not match request FID [0x200000404:0xd3:0x0] nor data FID [0x200000404:0xd3:0x0]
01:51:44:[12925.005604] LustreError: 8149:0:(mdt_coordinator.c:1505:mdt_hsm_update_request_state()) lustre-MDT0000: Cannot find running request for cookie 0x59af5427 on fid=[0x200000403:0x83:0x0]
01:51:44:[13034.210227] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000404:0xd3:0x0]'.*action='RESTORE'/ {print $13}' | cut -f2 -d=
Logs for these failures are at b2_10: |
| Comments |
| Comment by Tim Day [ 07/Feb/23 ] |
|
+1 https://review.whamcloud.com/c/fs/lustre-release/+/49916, https://testing.whamcloud.com/test_sets/26377a4b-65a2-4225-8e79-7875779b5cd6 This happened on 251. This seems to occur very sporadically, affecting different tests with very similar symptoms. |
| Comment by Tim Day [ 07/Feb/23 ] |
|
See: https://jira.whamcloud.com/browse/LU-10440 https://jira.whamcloud.com/browse/LU-8136 https://jira.whamcloud.com/browse/LU-4093 https://jira.whamcloud.com/browse/LU-3985 https://jira.whamcloud.com/browse/LU-3863 https://jira.whamcloud.com/browse/LU-9300 https://jira.whamcloud.com/browse/LU-4652 https://jira.whamcloud.com/browse/LU-4125 https://jira.whamcloud.com/browse/LU-9306 and many more. I don't know if there is already a master ticket to track these (or if Jira even has a mechanism like that). |