[LU-9953] sanity-hsm test_31a fails with 'request on [0x200000404:0xd3:0x0] is not SUCCEED on mds1' Created: 06/Sep/17  Updated: 07/Feb/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.1, Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-hsm test_31a fails because the test waits 200 seconds for the state to change from RESTORE to SUCCEED. From the test log:

size seen is right: 32000000 == 32000000
 restore is too long
CMD: trevis-42vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000404:0xd3:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: trevis-42vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000404:0xd3:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Waiting 200 secs for update
…
CMD: trevis-42vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000404:0xd3:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
CMD: trevis-42vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000404:0xd3:0x0]'.*action='RESTORE'/ {print \$13}' | cut -f2 -d=
Update not seen after 200s: wanted 'SUCCEED' got ''
 sanity-hsm test_31a: @@@@@@ FAIL: request on [0x200000404:0xd3:0x0] is not SUCCEED on mds1 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5291:error()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:722:wait_request_state()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:2908:restore_and_check_size()
  = /usr/lib64/lustre/tests/sanity-hsm.sh:2925:test_31a()

From the copytool log, we can see that the file or directory does not exist

1504662572.818431 lhsmtool_posix[8903]: progress ioctl for copy '/tmp/arc1/shsm/00d3/0000/0404/0000/0002/0000/0x200000404:0xd3:0x0'->'{VOLATILE}=[0x200000403:0x83:0x0]' failed: No such file or directory (2)
1504662572.818447 lhsmtool_posix[8903]: copied 32000000 bytes in 0.007760 seconds
1504662572.818453 lhsmtool_posix[8903]: cannot copy data from '/tmp/arc1/shsm/00d3/0000/0404/0000/0002/0000/0x200000404:0xd3:0x0' to '{VOLATILE}=[0x200000403:0x83:0x0]': No such file or directory (2)
1504662572.818465 lhsmtool_posix[8903]: Action completed, notifying coordinator cookie=0x59af5427, FID=[0x200000404:0xd3:0x0], hp_flags=0 err=2
1504662572.818960 lhsmtool_posix[8903]: llapi_hsm_action_end() on '/mnt/lustre2/.lustre/fid/0x200000404:0xd3:exiting: Terminated
 file or directory (2)
exiting: Terminated

From the MDS console, we see two error messages from the mdt_coordinator; the first error message is seen in logs where the test passes, but the second error message is, I think, a real error

01:51:44:[12922.396156] Lustre: DEBUG MARKER: == sanity-hsm test 31a: Import a large file and check size during restore ============================ 01:49:28 (1504662568)
01:51:44:[12923.977685] Lustre: HSM agent 809e6962-34fe-a02a-1cb7-6b8a8eba00b0 already registered
01:51:44:[12924.998864] LustreError: 10237:0:(mdt_coordinator.c:1535:mdt_hsm_update_request_state()) lustre-MDT0000: Progress on [0x200000403:0x82:0x0] for cookie 0x59af5427 does not match request FID [0x200000404:0xd3:0x0] nor data FID [0x200000404:0xd3:0x0]
01:51:44:[12925.005604] LustreError: 8149:0:(mdt_coordinator.c:1505:mdt_hsm_update_request_state()) lustre-MDT0000: Cannot find running request for cookie 0x59af5427 on fid=[0x200000403:0x83:0x0]
01:51:44:[13034.210227] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'[0x200000404:0xd3:0x0]'.*action='RESTORE'/ {print $13}' | cut -f2 -d=

Logs for these failures are at
master:
https://testing.hpdd.intel.com/test_sets/13c63484-92ab-11e7-b67f-5254006e85c2

b2_10:
https://testing.hpdd.intel.com/test_sets/95218046-7f6f-11e7-96e0-5254006e85c2



 Comments   
Comment by Tim Day [ 07/Feb/23 ]

+1 https://review.whamcloud.com/c/fs/lustre-release/+/49916, https://testing.whamcloud.com/test_sets/26377a4b-65a2-4225-8e79-7875779b5cd6

This happened on 251. This seems to occur very sporadically, affecting different tests with very similar symptoms.

Comment by Tim Day [ 07/Feb/23 ]

See:

https://jira.whamcloud.com/browse/LU-10440

https://jira.whamcloud.com/browse/LU-8136

https://jira.whamcloud.com/browse/LU-4093

https://jira.whamcloud.com/browse/LU-3985

https://jira.whamcloud.com/browse/LU-3863

https://jira.whamcloud.com/browse/LU-9300

https://jira.whamcloud.com/browse/LU-4652

https://jira.whamcloud.com/browse/LU-4125

https://jira.whamcloud.com/browse/LU-9306

and many more. I don't know if there is already a master ticket to track these (or if Jira even has a mechanism like that).

Generated at Sat Feb 10 02:30:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.