[LU-6223] HSM recovery needs more tests and fixes Created: 09/Feb/15  Updated: 21/Jan/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mikhail Pershin Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-5939 Error: trying to overwrite bigger tra... Resolved
is related to LU-6244 Handle multiple transactions produced... Closed
Severity: 3
Rank (Obsolete): 17410

 Description   

Recent ticket LU-5939 reveals that HSM requests were not participating in recovery at all but that was hidden for all that time. That means there are lack of tests for HSM recovery cases.

Simple test which simulates server failure reveals HSM recovery issues.
Test:

test_17() {
	# test needs a running copytool
	copytool_setup

	mkdir -p $DIR/$tdir
	local f=$DIR/$tdir/$tfile
	local fid=$(copy_file /etc/hosts $f)

	replay_barrier $SINGLEMDS
	$LFS hsm_archive $f || error "archive of $f failed"
	fail $SINGLEMDS
	wait_request_state $fid ARCHIVE SUCCEED

	$LFS hsm_release $f || error "release of $f failed"

	replay_barrier $SINGLEMDS
	$LFS hsm_restore $f || error "restore of $f failed"
	fail $SINGLEMDS
	wait_request_state $fid RESTORE SUCCEED

	echo -n "Verifying file state: "
	check_hsm_flags $f "0x00000009"

	diff -q /etc/hosts $f

	[[ $? -eq 0 ]] || error "Restored file differs"

	copytool_cleanup
}

Test failed on first server failure:
LustreError: 3248:0:(mdt_coordinator.c:985:mdt_hsm_cdt_start()) lustre-MDT0000: cannot take the layout locks needed for registered restore: -2

Logs:

00000004:00000040:0.0:1423319087.106764:0:26715:0:(mdd_object.c:1599:mdd_object_make_hint()) [0x200002b10:0x6:0x0] eadata (null) len 0
00000004:00001000:0.0:1423319087.106773:0:26715:0:(lod_object.c:3229:lod_ah_init()) final striping: # 1 stripes, sz 1048576 from 
00000001:00000002:0.0:1423319087.106779:0:26715:0:(linkea.c:136:linkea_add_buf()) New link_ea name '.:VOLATILE:0000:6B8B4567' is added
00020000:00020000:0.0:1423319087.106788:0:26715:0:(lod_qos.c:1715:lod_qos_parse_config()) lustre-MDT0000-mdtlov: unrecognized magic 0

That is reason for test failure at this stage.



 Comments   
Comment by Mikhail Pershin [ 13/Feb/15 ]

Another problem to resolve in context of this ticket is multiple transactions inside single request. This should be handled properly.

Comment by Bruno Faccini (Inactive) [ 25/Feb/15 ]

Hello Mike,
I am starting to work on this ticket, but it is unclear for me what needs to be specifically addressed here regarding work being done as part of LU-5939/LU-6244.
Is it only to add specific HSM recovery testing (including its multiple transactions in single request specific usage) ?

Comment by Mikhail Pershin [ 26/Feb/15 ]

Hi Bruno,

yes, this is about specific HSM recovery tests, see example above, test_17. It shows that HSM archive/restore can't survive server failover though they expected to do that. I expect to make this test pass as first task in context of this ticket, then we have to think about another recovery tests for HSM. I think we have to check all HSM modification requests for both replay and resend cases.

Comment by Bruno Faccini (Inactive) [ 26/Feb/15 ]

Ok, cool and thanks for the clarification!

Comment by Andreas Dilger [ 25/Aug/15 ]

Hi Bruno, any progress with this ticket? As a starting point, could you please submit a patch with the above test_17 to see what is failing and what needs to be fixed.

Comment by Gerrit Updater [ 28/Aug/15 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/16125
Subject: LU-6223 tests: recovery of HSM requests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6ba94a1f7ccaf2841daee19cfffd3150104f1d02

Comment by Mikhail Pershin [ 22/Jul/18 ]

This ticket stay opened for a quite long time but I don't see that this problem was resolved in any other way. Could someone who is working on HSM now review it and decide what to do?

Generated at Sat Feb 10 01:58:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.