[LU-6223] HSM recovery needs more tests and fixes Created: 09/Feb/15 Updated: 21/Jan/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Mikhail Pershin | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 17410 | ||||||||||||
| Description |
|
Recent ticket Simple test which simulates server failure reveals HSM recovery issues. test_17() {
# test needs a running copytool
copytool_setup
mkdir -p $DIR/$tdir
local f=$DIR/$tdir/$tfile
local fid=$(copy_file /etc/hosts $f)
replay_barrier $SINGLEMDS
$LFS hsm_archive $f || error "archive of $f failed"
fail $SINGLEMDS
wait_request_state $fid ARCHIVE SUCCEED
$LFS hsm_release $f || error "release of $f failed"
replay_barrier $SINGLEMDS
$LFS hsm_restore $f || error "restore of $f failed"
fail $SINGLEMDS
wait_request_state $fid RESTORE SUCCEED
echo -n "Verifying file state: "
check_hsm_flags $f "0x00000009"
diff -q /etc/hosts $f
[[ $? -eq 0 ]] || error "Restored file differs"
copytool_cleanup
}
Test failed on first server failure: Logs: 00000004:00000040:0.0:1423319087.106764:0:26715:0:(mdd_object.c:1599:mdd_object_make_hint()) [0x200002b10:0x6:0x0] eadata (null) len 0 00000004:00001000:0.0:1423319087.106773:0:26715:0:(lod_object.c:3229:lod_ah_init()) final striping: # 1 stripes, sz 1048576 from 00000001:00000002:0.0:1423319087.106779:0:26715:0:(linkea.c:136:linkea_add_buf()) New link_ea name '.:VOLATILE:0000:6B8B4567' is added 00020000:00020000:0.0:1423319087.106788:0:26715:0:(lod_qos.c:1715:lod_qos_parse_config()) lustre-MDT0000-mdtlov: unrecognized magic 0 That is reason for test failure at this stage. |
| Comments |
| Comment by Mikhail Pershin [ 13/Feb/15 ] |
|
Another problem to resolve in context of this ticket is multiple transactions inside single request. This should be handled properly. |
| Comment by Bruno Faccini (Inactive) [ 25/Feb/15 ] |
|
Hello Mike, |
| Comment by Mikhail Pershin [ 26/Feb/15 ] |
|
Hi Bruno, yes, this is about specific HSM recovery tests, see example above, test_17. It shows that HSM archive/restore can't survive server failover though they expected to do that. I expect to make this test pass as first task in context of this ticket, then we have to think about another recovery tests for HSM. I think we have to check all HSM modification requests for both replay and resend cases. |
| Comment by Bruno Faccini (Inactive) [ 26/Feb/15 ] |
|
Ok, cool and thanks for the clarification! |
| Comment by Andreas Dilger [ 25/Aug/15 ] |
|
Hi Bruno, any progress with this ticket? As a starting point, could you please submit a patch with the above test_17 to see what is failing and what needs to be fixed. |
| Comment by Gerrit Updater [ 28/Aug/15 ] |
|
Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/16125 |
| Comment by Mikhail Pershin [ 22/Jul/18 ] |
|
This ticket stay opened for a quite long time but I don't see that this problem was resolved in any other way. Could someone who is working on HSM now review it and decide what to do? |