Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1060

Test failure on test suite replay-vbr, subtest test_7c

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • Lustre 2.2.0
    • Lustre 2.2.0, Lustre 2.1.1
    • None
    • 3
    • 6473

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/c95fc5c2-4c42-11e1-bd50-5254004bbbd3.

      The sub-test test_7c failed with the following error:

      Test 7c.2 failed

      Info required for matching: replay-vbr 7c

      Attachments

        Issue Links

          Activity

            [LU-1060] Test failure on test suite replay-vbr, subtest test_7c
            pjones Peter Jones added a comment -

            duplicate of lu-966

            pjones Peter Jones added a comment - duplicate of lu-966
            bobijam Zhenyu Xu added a comment -

            the patch to let vbr version check replay non-exist object is posted at http://review.whamcloud.com/2149

            description: For replay cases, mdt_version_get_check will check non-exist mdt object and evict clients accordingly, but mdt_object_find will not set exp_vbr_failed and will not evict the faulty client.

            bobijam Zhenyu Xu added a comment - the patch to let vbr version check replay non-exist object is posted at http://review.whamcloud.com/2149 description: For replay cases, mdt_version_get_check will check non-exist mdt object and evict clients accordingly, but mdt_object_find will not set exp_vbr_failed and will not evict the faulty client.

            Removing the patch for LU-966 directly may be not the best solution. If you do not like my patch for LU-1060, we can fix it case by case. I think Bobijam is working on such patch to erase the side-effect of his LU-966 patch.

            yong.fan nasf (Inactive) added a comment - Removing the patch for LU-966 directly may be not the best solution. If you do not like my patch for LU-1060 , we can fix it case by case. I think Bobijam is working on such patch to erase the side-effect of his LU-966 patch.

            Right, your patch will work to cover some case but it is just fast fix to hide bad effects of previous wrong patch, that is the way we shouldn't go for sure. It hides side-effects but doesn't fix the root cause, moreover it doesn't fix broken VBR which can cause unneeded evictions after LU-966. The right way will be step back to the point where all issues appeared - LU-966.

            Basically we need to revert that patch and apply its first version - just replace assertions with error checks, it straight-forward and easy to follow. The idea to make that early in MDT was wrong and we missed that, any further attempts to fix that in MDT will cause more complexity there and more checks. I've made patch already:
            http://review.whamcloud.com/#change,2148

            Another my worry is about test set for master review testing, I don't get why it misses replay-vbr and runtests which are pretty good tests. LU-1060 appeared right after LU-966 landing and nobody noticed that. This is out of scope this bug though.

            tappro Mikhail Pershin added a comment - Right, your patch will work to cover some case but it is just fast fix to hide bad effects of previous wrong patch, that is the way we shouldn't go for sure. It hides side-effects but doesn't fix the root cause, moreover it doesn't fix broken VBR which can cause unneeded evictions after LU-966 . The right way will be step back to the point where all issues appeared - LU-966 . Basically we need to revert that patch and apply its first version - just replace assertions with error checks, it straight-forward and easy to follow. The idea to make that early in MDT was wrong and we missed that, any further attempts to fix that in MDT will cause more complexity there and more checks. I've made patch already: http://review.whamcloud.com/#change,2148 Another my worry is about test set for master review testing, I don't get why it misses replay-vbr and runtests which are pretty good tests. LU-1060 appeared right after LU-966 landing and nobody noticed that. This is out of scope this bug though.

            LU-1060 is caused by LU-966 improper fix.

            tappro Mikhail Pershin added a comment - LU-1060 is caused by LU-966 improper fix.

            Right, so to erase the side-affect of LU-966 patch, my patch works well. Otherwise, we have to fix up all related points one by one, and if someone add new points in the future, he/she has to consider again.

            So, what's your idea?

            yong.fan nasf (Inactive) added a comment - Right, so to erase the side-affect of LU-966 patch, my patch works well. Otherwise, we have to fix up all related points one by one, and if someone add new points in the future, he/she has to consider again. So, what's your idea?

            this is result of LU-966 patch, unfortunately nobody noticed it ruins VBR recovery because mdt_object_find() may exits early without any VBR checking.

            tappro Mikhail Pershin added a comment - this is result of LU-966 patch, unfortunately nobody noticed it ruins VBR recovery because mdt_object_find() may exits early without any VBR checking.

            It is not interoperability issue, it can be reproduced against latest master branch by replay-vbr test_7c.

            yong.fan nasf (Inactive) added a comment - It is not interoperability issue, it can be reproduced against latest master branch by replay-vbr test_7c.

            Looking through server syslog:

            Lustre: 31370:0:(client.c:2530:ptlrpc_replay_interpret()) @@@ Version mismatch during replay
              req@ffff88031a8b5000 x1392520277794082/t657129996304(657129996304) o-1->lustre-MDT0000_UUID@192.168.4.134@o2ib:12/10 lens 472/424 e 0 to 0 dl 1328013402 ref 2 fl Interpret:R/ffffffff/ffffffff rc -75/-1
            Lustre: 31370:0:(client.c:2530:ptlrpc_replay_interpret()) Skipped 3 previous similar messages
            Lustre: 31370:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1392520277794320 sent from lustre-MDT0000-mdc-ffff880307c27800 to NID 192.168.4.134@o2ib has timed out for slow reply: [sent 1328013368] [real_sent 1328013368] [current 1328013402] [deadline 34s] [delay 0s]  req@ffff880316c8f800 x1392520277794320/t0(0) o-1->lustre-MDT0000_UUID@192.168.4.134@o2ib:12/10 lens 192/192 e 0 to 1 dl 1328013402 ref 1 fl Rpc:X/ffffffff/ffffffff rc 0/-1
            Lustre: 31370:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 37 previous similar messages
            Lustre: 31370:0:(import.c:1160:completed_replay_interpret()) lustre-MDT0000-mdc-ffff880307c27800: version recovery fails, reconnecting
            LustreError: 167-0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
            

            That shows the VBR works fine, mismatch was detected and recovery fails.

            but a bit later:

            Lustre: lustre-MDT0000-mdc-ffff88024b596400: Connection to service lustre-MDT0000 via nid 192.168.4.134@o2ib was lost; in progress operations using this service will wait for recovery to complete.
            Lustre: Skipped 9 previous similar messages
            Lustre: 31370:0:(import.c:852:ptlrpc_connect_interpret()) MGS@192.168.4.134@o2ib changed server handle from 0xa0636bc0947b83ea to 0xa0636bc0947b8a57
            LustreError: 31370:0:(client.c:2573:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff8802c7375800 x1392520277794470/t661424963601(661424963601) o-1->lustre-MDT0000_UUID@192.168.4.134@o2ib:12/10 lens 472/424 e 0 to 0 dl 1328013526 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1
            Lustre: lustre-MDT0000-mdc-ffff88024b596400: Connection restored to service lustre-MDT0000 using nid 192.168.4.134@o2ib
            

            I have no idea what is that so far

            tappro Mikhail Pershin added a comment - Looking through server syslog: Lustre: 31370:0:(client.c:2530:ptlrpc_replay_interpret()) @@@ Version mismatch during replay req@ffff88031a8b5000 x1392520277794082/t657129996304(657129996304) o-1->lustre-MDT0000_UUID@192.168.4.134@o2ib:12/10 lens 472/424 e 0 to 0 dl 1328013402 ref 2 fl Interpret:R/ffffffff/ffffffff rc -75/-1 Lustre: 31370:0:(client.c:2530:ptlrpc_replay_interpret()) Skipped 3 previous similar messages Lustre: 31370:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1392520277794320 sent from lustre-MDT0000-mdc-ffff880307c27800 to NID 192.168.4.134@o2ib has timed out for slow reply: [sent 1328013368] [real_sent 1328013368] [current 1328013402] [deadline 34s] [delay 0s] req@ffff880316c8f800 x1392520277794320/t0(0) o-1->lustre-MDT0000_UUID@192.168.4.134@o2ib:12/10 lens 192/192 e 0 to 1 dl 1328013402 ref 1 fl Rpc:X/ffffffff/ffffffff rc 0/-1 Lustre: 31370:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 37 previous similar messages Lustre: 31370:0:( import .c:1160:completed_replay_interpret()) lustre-MDT0000-mdc-ffff880307c27800: version recovery fails, reconnecting LustreError: 167-0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. That shows the VBR works fine, mismatch was detected and recovery fails. but a bit later: Lustre: lustre-MDT0000-mdc-ffff88024b596400: Connection to service lustre-MDT0000 via nid 192.168.4.134@o2ib was lost; in progress operations using this service will wait for recovery to complete. Lustre: Skipped 9 previous similar messages Lustre: 31370:0:( import .c:852:ptlrpc_connect_interpret()) MGS@192.168.4.134@o2ib changed server handle from 0xa0636bc0947b83ea to 0xa0636bc0947b8a57 LustreError: 31370:0:(client.c:2573:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff8802c7375800 x1392520277794470/t661424963601(661424963601) o-1->lustre-MDT0000_UUID@192.168.4.134@o2ib:12/10 lens 472/424 e 0 to 0 dl 1328013526 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1 Lustre: lustre-MDT0000-mdc-ffff88024b596400: Connection restored to service lustre-MDT0000 using nid 192.168.4.134@o2ib I have no idea what is that so far

            VBR detects ENOENT cases already and fails if some object is missing. For such object VBR count its version as ENOENT_VERSION and compare it with version in replay, FID_002 in your example. So there must be version mismatch. If that doesn't work for some reason, we need to find that reason. Keep in mind that this worked well for a long time but failed in 2.1<->2.1.55 case, so probably this is compatibility issue between lustre versions. First of all I'd check where that -ENOENT came from and why VBR checks missed it.

            tappro Mikhail Pershin added a comment - VBR detects ENOENT cases already and fails if some object is missing. For such object VBR count its version as ENOENT_VERSION and compare it with version in replay, FID_002 in your example. So there must be version mismatch. If that doesn't work for some reason, we need to find that reason. Keep in mind that this worked well for a long time but failed in 2.1<->2.1.55 case, so probably this is compatibility issue between lustre versions. First of all I'd check where that -ENOENT came from and why VBR checks missed it.

            People

              yong.fan nasf (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: