Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
3
-
17410
Description
Recent ticket LU-5939 reveals that HSM requests were not participating in recovery at all but that was hidden for all that time. That means there are lack of tests for HSM recovery cases.
Simple test which simulates server failure reveals HSM recovery issues.
Test:
test_17() { # test needs a running copytool copytool_setup mkdir -p $DIR/$tdir local f=$DIR/$tdir/$tfile local fid=$(copy_file /etc/hosts $f) replay_barrier $SINGLEMDS $LFS hsm_archive $f || error "archive of $f failed" fail $SINGLEMDS wait_request_state $fid ARCHIVE SUCCEED $LFS hsm_release $f || error "release of $f failed" replay_barrier $SINGLEMDS $LFS hsm_restore $f || error "restore of $f failed" fail $SINGLEMDS wait_request_state $fid RESTORE SUCCEED echo -n "Verifying file state: " check_hsm_flags $f "0x00000009" diff -q /etc/hosts $f [[ $? -eq 0 ]] || error "Restored file differs" copytool_cleanup }
Test failed on first server failure:
LustreError: 3248:0:(mdt_coordinator.c:985:mdt_hsm_cdt_start()) lustre-MDT0000: cannot take the layout locks needed for registered restore: -2
Logs:
00000004:00000040:0.0:1423319087.106764:0:26715:0:(mdd_object.c:1599:mdd_object_make_hint()) [0x200002b10:0x6:0x0] eadata (null) len 0 00000004:00001000:0.0:1423319087.106773:0:26715:0:(lod_object.c:3229:lod_ah_init()) final striping: # 1 stripes, sz 1048576 from 00000001:00000002:0.0:1423319087.106779:0:26715:0:(linkea.c:136:linkea_add_buf()) New link_ea name '.:VOLATILE:0000:6B8B4567' is added 00020000:00020000:0.0:1423319087.106788:0:26715:0:(lod_qos.c:1715:lod_qos_parse_config()) lustre-MDT0000-mdtlov: unrecognized magic 0
That is reason for test failure at this stage.
This ticket stay opened for a quite long time but I don't see that this problem was resolved in any other way. Could someone who is working on HSM now review it and decide what to do?