Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6223

HSM recovery needs more tests and fixes

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • None
    • 3
    • 17410

    Description

      Recent ticket LU-5939 reveals that HSM requests were not participating in recovery at all but that was hidden for all that time. That means there are lack of tests for HSM recovery cases.

      Simple test which simulates server failure reveals HSM recovery issues.
      Test:

      test_17() {
      	# test needs a running copytool
      	copytool_setup
      
      	mkdir -p $DIR/$tdir
      	local f=$DIR/$tdir/$tfile
      	local fid=$(copy_file /etc/hosts $f)
      
      	replay_barrier $SINGLEMDS
      	$LFS hsm_archive $f || error "archive of $f failed"
      	fail $SINGLEMDS
      	wait_request_state $fid ARCHIVE SUCCEED
      
      	$LFS hsm_release $f || error "release of $f failed"
      
      	replay_barrier $SINGLEMDS
      	$LFS hsm_restore $f || error "restore of $f failed"
      	fail $SINGLEMDS
      	wait_request_state $fid RESTORE SUCCEED
      
      	echo -n "Verifying file state: "
      	check_hsm_flags $f "0x00000009"
      
      	diff -q /etc/hosts $f
      
      	[[ $? -eq 0 ]] || error "Restored file differs"
      
      	copytool_cleanup
      }
      

      Test failed on first server failure:
      LustreError: 3248:0:(mdt_coordinator.c:985:mdt_hsm_cdt_start()) lustre-MDT0000: cannot take the layout locks needed for registered restore: -2

      Logs:

      00000004:00000040:0.0:1423319087.106764:0:26715:0:(mdd_object.c:1599:mdd_object_make_hint()) [0x200002b10:0x6:0x0] eadata (null) len 0
      00000004:00001000:0.0:1423319087.106773:0:26715:0:(lod_object.c:3229:lod_ah_init()) final striping: # 1 stripes, sz 1048576 from 
      00000001:00000002:0.0:1423319087.106779:0:26715:0:(linkea.c:136:linkea_add_buf()) New link_ea name '.:VOLATILE:0000:6B8B4567' is added
      00020000:00020000:0.0:1423319087.106788:0:26715:0:(lod_qos.c:1715:lod_qos_parse_config()) lustre-MDT0000-mdtlov: unrecognized magic 0
      

      That is reason for test failure at this stage.

      Attachments

        Issue Links

          Activity

            [LU-6223] HSM recovery needs more tests and fixes

            This ticket stay opened for a quite long time but I don't see that this problem was resolved in any other way. Could someone who is working on HSM now review it and decide what to do?

            tappro Mikhail Pershin added a comment - This ticket stay opened for a quite long time but I don't see that this problem was resolved in any other way. Could someone who is working on HSM now review it and decide what to do?

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/16125
            Subject: LU-6223 tests: recovery of HSM requests
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 6ba94a1f7ccaf2841daee19cfffd3150104f1d02

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/16125 Subject: LU-6223 tests: recovery of HSM requests Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6ba94a1f7ccaf2841daee19cfffd3150104f1d02

            Hi Bruno, any progress with this ticket? As a starting point, could you please submit a patch with the above test_17 to see what is failing and what needs to be fixed.

            adilger Andreas Dilger added a comment - Hi Bruno, any progress with this ticket? As a starting point, could you please submit a patch with the above test_17 to see what is failing and what needs to be fixed.

            Ok, cool and thanks for the clarification!

            bfaccini Bruno Faccini (Inactive) added a comment - Ok, cool and thanks for the clarification!

            Hi Bruno,

            yes, this is about specific HSM recovery tests, see example above, test_17. It shows that HSM archive/restore can't survive server failover though they expected to do that. I expect to make this test pass as first task in context of this ticket, then we have to think about another recovery tests for HSM. I think we have to check all HSM modification requests for both replay and resend cases.

            tappro Mikhail Pershin added a comment - Hi Bruno, yes, this is about specific HSM recovery tests, see example above, test_17. It shows that HSM archive/restore can't survive server failover though they expected to do that. I expect to make this test pass as first task in context of this ticket, then we have to think about another recovery tests for HSM. I think we have to check all HSM modification requests for both replay and resend cases.

            Hello Mike,
            I am starting to work on this ticket, but it is unclear for me what needs to be specifically addressed here regarding work being done as part of LU-5939/LU-6244.
            Is it only to add specific HSM recovery testing (including its multiple transactions in single request specific usage) ?

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Mike, I am starting to work on this ticket, but it is unclear for me what needs to be specifically addressed here regarding work being done as part of LU-5939 / LU-6244 . Is it only to add specific HSM recovery testing (including its multiple transactions in single request specific usage) ?

            Another problem to resolve in context of this ticket is multiple transactions inside single request. This should be handled properly.

            tappro Mikhail Pershin added a comment - Another problem to resolve in context of this ticket is multiple transactions inside single request. This should be handled properly.

            People

              wc-triage WC Triage
              tappro Mikhail Pershin
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated: