Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6223

HSM recovery needs more tests and fixes

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None
    • Severity:
      3
    • Rank (Obsolete):
      17410

      Description

      Recent ticket LU-5939 reveals that HSM requests were not participating in recovery at all but that was hidden for all that time. That means there are lack of tests for HSM recovery cases.

      Simple test which simulates server failure reveals HSM recovery issues.
      Test:

      test_17() {
      	# test needs a running copytool
      	copytool_setup
      
      	mkdir -p $DIR/$tdir
      	local f=$DIR/$tdir/$tfile
      	local fid=$(copy_file /etc/hosts $f)
      
      	replay_barrier $SINGLEMDS
      	$LFS hsm_archive $f || error "archive of $f failed"
      	fail $SINGLEMDS
      	wait_request_state $fid ARCHIVE SUCCEED
      
      	$LFS hsm_release $f || error "release of $f failed"
      
      	replay_barrier $SINGLEMDS
      	$LFS hsm_restore $f || error "restore of $f failed"
      	fail $SINGLEMDS
      	wait_request_state $fid RESTORE SUCCEED
      
      	echo -n "Verifying file state: "
      	check_hsm_flags $f "0x00000009"
      
      	diff -q /etc/hosts $f
      
      	[[ $? -eq 0 ]] || error "Restored file differs"
      
      	copytool_cleanup
      }
      

      Test failed on first server failure:
      LustreError: 3248:0:(mdt_coordinator.c:985:mdt_hsm_cdt_start()) lustre-MDT0000: cannot take the layout locks needed for registered restore: -2

      Logs:

      00000004:00000040:0.0:1423319087.106764:0:26715:0:(mdd_object.c:1599:mdd_object_make_hint()) [0x200002b10:0x6:0x0] eadata (null) len 0
      00000004:00001000:0.0:1423319087.106773:0:26715:0:(lod_object.c:3229:lod_ah_init()) final striping: # 1 stripes, sz 1048576 from 
      00000001:00000002:0.0:1423319087.106779:0:26715:0:(linkea.c:136:linkea_add_buf()) New link_ea name '.:VOLATILE:0000:6B8B4567' is added
      00020000:00020000:0.0:1423319087.106788:0:26715:0:(lod_qos.c:1715:lod_qos_parse_config()) lustre-MDT0000-mdtlov: unrecognized magic 0
      

      That is reason for test failure at this stage.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                wc-triage WC Triage
                Reporter:
                tappro Mikhail Pershin
              • Votes:
                0 Vote for this issue
                Watchers:
                17 Start watching this issue

                Dates

                • Created:
                  Updated: