Details

    • Technical task
    • Resolution: Duplicate
    • Blocker
    • Lustre 2.6.0
    • Lustre 2.5.0
    • Patches submitted to autotest
    • 9548

    Description

      from https://maloo.whamcloud.com/test_sets/0afc2c56-fc86-11e2-8ce2-52540035b04c

      This sanity-hsm test 21 seems to be hitting a lot right now
      Wrong block number is one of the errors seen.

      test_21 	
      
          Error: 'wrong block number'
          Failure Rate: 33.00% of last 100 executions [all branches] 
      
      == sanity-hsm test 21: Simple release tests == 23:18:20 (1375510700)
      2+0 records in
      2+0 records out
      2097152 bytes (2.1 MB) copied, 0.353933 s, 5.9 MB/s
       sanity-hsm test_21: @@@@@@ FAIL: wrong block number 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4202:error_noexit()
      

      Attachments

        Issue Links

          Activity

            [LU-3700] sanity-hsm test_21 Error: 'wrong block number'

            Nathaniel, can you please submit a separate patch to disable this test for ZFS only. That can test and possibly land in parallel with 8467 if that does not give us any relief.

            adilger Andreas Dilger added a comment - Nathaniel, can you please submit a separate patch to disable this test for ZFS only. That can test and possibly land in parallel with 8467 if that does not give us any relief.
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Patch is at http://review.whamcloud.com/8467.
            But I wonder how, even for ZFS, released file st_blocks can be nothing else than 1 after my patch #7776 to force it as part of LU-3864

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Patch is at http://review.whamcloud.com/8467 . But I wonder how, even for ZFS, released file st_blocks can be nothing else than 1 after my patch #7776 to force it as part of LU-3864 …

            I'll work up and test a patch per Andreas's comment.

            utopiabound Nathaniel Clark added a comment - I'll work up and test a patch per Andreas's comment.

            It would make sense to me that "lfs hsm_release" would cause all of the DLM locks to be revoked from the client, so any stat from the client would return st_blocks == 1. This should be visible in the debug logs, if this test is run with at least +dlmtrace enabled.

            I think in the short term it makes sense to fix sanity-hsm.sh test_21 to enable full debug for this test (using debugsave() and debugrestore()), and print the actual block number that is returned. Maybe it is as simple as ZFS returning 2 with an external xattr or something, which might even happen with ldiskfs? Probably it makes sense to also allow some small margin, like 5 blocks or so. This test is also bad because there are two places that print "wrong block number", and it isn't even clear which one is failing.

            I think it makes sense to submit a patch to change this immediately to the following:

                    local fid=$(make_small $f)
                    local orig_size=$(stat -c "%s" $f)
                    local orig_blocks=$(stat -c "%b" $f)
            
                    check_hsm_flags $f "0x00000000"
                    $LFS hsm_archive $f || error "could not archive file"
                    wait_request_state $fid ARCHIVE SUCCEED
            
                    local blocks=$(stat -c "%b" $f)
                    [ $blocks -eq $orig_blocks ] || error "$f: wrong blocks after archive: $blocks != $orig_blocks"
                    local size=$(stat -c "%s" $f)
                    [ $size -eq $orig_size ] || error "$f: wrong size after archive: $size != $orig_size"
            
                    # Release and check states
                    $LFS hsm_release $f || error "$f: could not release file"
                    check_hsm_flags $f "0x0000000d"
            
                    blocks=$(stat -c "%b" $f)
                    [ $blocks -gt 5 ] || error "$f: too many blocks after release: $blocks > 5"
                    size=$(stat -c "%s" $f)
                    [ $size -ne $orig_size ] || error "$f: wrong size after release: $size != $orig_size"
            

            Maybe this will allow ZFS to pass, but even if it doesn't then we will have more information to debug the problem.

            adilger Andreas Dilger added a comment - It would make sense to me that " lfs hsm_release " would cause all of the DLM locks to be revoked from the client, so any stat from the client would return st_blocks == 1. This should be visible in the debug logs, if this test is run with at least +dlmtrace enabled. I think in the short term it makes sense to fix sanity-hsm.sh test_21 to enable full debug for this test (using debugsave() and debugrestore()), and print the actual block number that is returned. Maybe it is as simple as ZFS returning 2 with an external xattr or something, which might even happen with ldiskfs? Probably it makes sense to also allow some small margin, like 5 blocks or so. This test is also bad because there are two places that print "wrong block number", and it isn't even clear which one is failing. I think it makes sense to submit a patch to change this immediately to the following: local fid=$(make_small $f) local orig_size=$(stat -c "%s" $f) local orig_blocks=$(stat -c "%b" $f) check_hsm_flags $f "0x00000000" $LFS hsm_archive $f || error "could not archive file" wait_request_state $fid ARCHIVE SUCCEED local blocks=$(stat -c "%b" $f) [ $blocks -eq $orig_blocks ] || error "$f: wrong blocks after archive: $blocks != $orig_blocks" local size=$(stat -c "%s" $f) [ $size -eq $orig_size ] || error "$f: wrong size after archive: $size != $orig_size" # Release and check states $LFS hsm_release $f || error "$f: could not release file" check_hsm_flags $f "0x0000000d" blocks=$(stat -c "%b" $f) [ $blocks -gt 5 ] || error "$f: too many blocks after release: $blocks > 5" size=$(stat -c "%s" $f) [ $size -ne $orig_size ] || error "$f: wrong size after release: $size != $orig_size" Maybe this will allow ZFS to pass, but even if it doesn't then we will have more information to debug the problem.

            Could it be that there is some timing delay, and more likely with ZFS, between hsm_release and st_blocks to become 1 ?
            Will setup a ZFS platform and run sanity-hsm/test_21 in a loop to reproduce.

            bfaccini Bruno Faccini (Inactive) added a comment - Could it be that there is some timing delay, and more likely with ZFS, between hsm_release and st_blocks to become 1 ? Will setup a ZFS platform and run sanity-hsm/test_21 in a loop to reproduce.
            adilger Andreas Dilger added a comment - Failed again on ZFS. https://maloo.whamcloud.com/test_sessions/58a20a92-5759-11e3-8d5c-52540035b04c
            utopiabound Nathaniel Clark added a comment - This has been happening on ZFS pretty regularly: https://maloo.whamcloud.com/test_sets/5af74a0a-575e-11e3-8d5c-52540035b04c https://maloo.whamcloud.com/test_sets/6b9c8f22-5741-11e3-a296-52540035b04c

            To be re-opened in case of re-occurence.

            bfaccini Bruno Faccini (Inactive) added a comment - To be re-opened in case of re-occurence.

            I am not able to reproduce problem with current master, even by running sanity-hsm/test_21 in a loop.

            BTW, according to Maloo reports test_21 failures for 'wrong block number' stopped around Aug. 13th. And this seems to match with landing of patch for LU-3561 that brings "real" HSM features (copytool, lfs hsm-commands usage instead of hsm-flags setting) in tests and according tools testing.

            So my strong assumption is that that this ticket can be closed because unrelated now.

            bfaccini Bruno Faccini (Inactive) added a comment - I am not able to reproduce problem with current master, even by running sanity-hsm/test_21 in a loop. BTW, according to Maloo reports test_21 failures for 'wrong block number' stopped around Aug. 13th. And this seems to match with landing of patch for LU-3561 that brings "real" HSM features (copytool, lfs hsm-commands usage instead of hsm-flags setting) in tests and according tools testing. So my strong assumption is that that this ticket can be closed because unrelated now.
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            "stat -c %b <file>" failed to return 0 after "lfs hsm_set --archived --exist <file>".

            I am currently investigating test's lustre debug-logs, but it misses HSM debug traces ...

            May be currently modified sanity-hsm tests to mimic future copytool behavior need to also use commands to wait for HSM actions to complete ? Like "wait_request_state $fid ARCHIVE SUCCEED" in case of concerned "lfs hsm_set --archived --exist <file>" when used as a replacement for "lfs hsm_archive <file>".

            Will try to setup a platform to reproduce problem, with HSM debug traces enabled on Client/MDS VMs, and running sanity-hsm tests in a loop.

            bfaccini Bruno Faccini (Inactive) added a comment - - edited "stat -c %b <file>" failed to return 0 after "lfs hsm_set --archived --exist <file>". I am currently investigating test's lustre debug-logs, but it misses HSM debug traces ... May be currently modified sanity-hsm tests to mimic future copytool behavior need to also use commands to wait for HSM actions to complete ? Like "wait_request_state $fid ARCHIVE SUCCEED" in case of concerned "lfs hsm_set --archived --exist <file>" when used as a replacement for "lfs hsm_archive <file>". Will try to setup a platform to reproduce problem, with HSM debug traces enabled on Client/MDS VMs, and running sanity-hsm tests in a loop.

            People

              bfaccini Bruno Faccini (Inactive)
              keith Keith Mannthey (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: