Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1060

Test failure on test suite replay-vbr, subtest test_7c

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • Lustre 2.2.0
    • Lustre 2.2.0, Lustre 2.1.1
    • None
    • 3
    • 6473

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/c95fc5c2-4c42-11e1-bd50-5254004bbbd3.

      The sub-test test_7c failed with the following error:

      Test 7c.2 failed

      Info required for matching: replay-vbr 7c

      Attachments

        Issue Links

          Activity

            [LU-1060] Test failure on test suite replay-vbr, subtest test_7c
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.2.0 [ 10082 ]
            Resolution New: Duplicate [ 3 ]
            Status Original: In Progress [ 3 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            duplicate of lu-966

            pjones Peter Jones added a comment - duplicate of lu-966
            pjones Peter Jones made changes -
            Summary Original: 2.1<->2.1.55 Test failure on test suite replay-vbr, subtest test_7c New: Test failure on test suite replay-vbr, subtest test_7c
            bobijam Zhenyu Xu added a comment -

            the patch to let vbr version check replay non-exist object is posted at http://review.whamcloud.com/2149

            description: For replay cases, mdt_version_get_check will check non-exist mdt object and evict clients accordingly, but mdt_object_find will not set exp_vbr_failed and will not evict the faulty client.

            bobijam Zhenyu Xu added a comment - the patch to let vbr version check replay non-exist object is posted at http://review.whamcloud.com/2149 description: For replay cases, mdt_version_get_check will check non-exist mdt object and evict clients accordingly, but mdt_object_find will not set exp_vbr_failed and will not evict the faulty client.

            Removing the patch for LU-966 directly may be not the best solution. If you do not like my patch for LU-1060, we can fix it case by case. I think Bobijam is working on such patch to erase the side-effect of his LU-966 patch.

            yong.fan nasf (Inactive) added a comment - Removing the patch for LU-966 directly may be not the best solution. If you do not like my patch for LU-1060 , we can fix it case by case. I think Bobijam is working on such patch to erase the side-effect of his LU-966 patch.

            Right, your patch will work to cover some case but it is just fast fix to hide bad effects of previous wrong patch, that is the way we shouldn't go for sure. It hides side-effects but doesn't fix the root cause, moreover it doesn't fix broken VBR which can cause unneeded evictions after LU-966. The right way will be step back to the point where all issues appeared - LU-966.

            Basically we need to revert that patch and apply its first version - just replace assertions with error checks, it straight-forward and easy to follow. The idea to make that early in MDT was wrong and we missed that, any further attempts to fix that in MDT will cause more complexity there and more checks. I've made patch already:
            http://review.whamcloud.com/#change,2148

            Another my worry is about test set for master review testing, I don't get why it misses replay-vbr and runtests which are pretty good tests. LU-1060 appeared right after LU-966 landing and nobody noticed that. This is out of scope this bug though.

            tappro Mikhail Pershin added a comment - Right, your patch will work to cover some case but it is just fast fix to hide bad effects of previous wrong patch, that is the way we shouldn't go for sure. It hides side-effects but doesn't fix the root cause, moreover it doesn't fix broken VBR which can cause unneeded evictions after LU-966 . The right way will be step back to the point where all issues appeared - LU-966 . Basically we need to revert that patch and apply its first version - just replace assertions with error checks, it straight-forward and easy to follow. The idea to make that early in MDT was wrong and we missed that, any further attempts to fix that in MDT will cause more complexity there and more checks. I've made patch already: http://review.whamcloud.com/#change,2148 Another my worry is about test set for master review testing, I don't get why it misses replay-vbr and runtests which are pretty good tests. LU-1060 appeared right after LU-966 landing and nobody noticed that. This is out of scope this bug though.
            tappro Mikhail Pershin made changes -
            Link New: This issue is related to LU-966 [ LU-966 ]

            LU-1060 is caused by LU-966 improper fix.

            tappro Mikhail Pershin added a comment - LU-1060 is caused by LU-966 improper fix.

            Right, so to erase the side-affect of LU-966 patch, my patch works well. Otherwise, we have to fix up all related points one by one, and if someone add new points in the future, he/she has to consider again.

            So, what's your idea?

            yong.fan nasf (Inactive) added a comment - Right, so to erase the side-affect of LU-966 patch, my patch works well. Otherwise, we have to fix up all related points one by one, and if someone add new points in the future, he/she has to consider again. So, what's your idea?

            this is result of LU-966 patch, unfortunately nobody noticed it ruins VBR recovery because mdt_object_find() may exits early without any VBR checking.

            tappro Mikhail Pershin added a comment - this is result of LU-966 patch, unfortunately nobody noticed it ruins VBR recovery because mdt_object_find() may exits early without any VBR checking.

            People

              yong.fan nasf (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: