Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1060

Test failure on test suite replay-vbr, subtest test_7c

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • Lustre 2.2.0
    • Lustre 2.2.0, Lustre 2.1.1
    • None
    • 3
    • 6473

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/c95fc5c2-4c42-11e1-bd50-5254004bbbd3.

      The sub-test test_7c failed with the following error:

      Test 7c.2 failed

      Info required for matching: replay-vbr 7c

      Attachments

        Issue Links

          Activity

            [LU-1060] Test failure on test suite replay-vbr, subtest test_7c

            Removing the patch for LU-966 directly may be not the best solution. If you do not like my patch for LU-1060, we can fix it case by case. I think Bobijam is working on such patch to erase the side-effect of his LU-966 patch.

            yong.fan nasf (Inactive) added a comment - Removing the patch for LU-966 directly may be not the best solution. If you do not like my patch for LU-1060 , we can fix it case by case. I think Bobijam is working on such patch to erase the side-effect of his LU-966 patch.

            Right, your patch will work to cover some case but it is just fast fix to hide bad effects of previous wrong patch, that is the way we shouldn't go for sure. It hides side-effects but doesn't fix the root cause, moreover it doesn't fix broken VBR which can cause unneeded evictions after LU-966. The right way will be step back to the point where all issues appeared - LU-966.

            Basically we need to revert that patch and apply its first version - just replace assertions with error checks, it straight-forward and easy to follow. The idea to make that early in MDT was wrong and we missed that, any further attempts to fix that in MDT will cause more complexity there and more checks. I've made patch already:
            http://review.whamcloud.com/#change,2148

            Another my worry is about test set for master review testing, I don't get why it misses replay-vbr and runtests which are pretty good tests. LU-1060 appeared right after LU-966 landing and nobody noticed that. This is out of scope this bug though.

            tappro Mikhail Pershin added a comment - Right, your patch will work to cover some case but it is just fast fix to hide bad effects of previous wrong patch, that is the way we shouldn't go for sure. It hides side-effects but doesn't fix the root cause, moreover it doesn't fix broken VBR which can cause unneeded evictions after LU-966 . The right way will be step back to the point where all issues appeared - LU-966 . Basically we need to revert that patch and apply its first version - just replace assertions with error checks, it straight-forward and easy to follow. The idea to make that early in MDT was wrong and we missed that, any further attempts to fix that in MDT will cause more complexity there and more checks. I've made patch already: http://review.whamcloud.com/#change,2148 Another my worry is about test set for master review testing, I don't get why it misses replay-vbr and runtests which are pretty good tests. LU-1060 appeared right after LU-966 landing and nobody noticed that. This is out of scope this bug though.

            LU-1060 is caused by LU-966 improper fix.

            tappro Mikhail Pershin added a comment - LU-1060 is caused by LU-966 improper fix.

            Right, so to erase the side-affect of LU-966 patch, my patch works well. Otherwise, we have to fix up all related points one by one, and if someone add new points in the future, he/she has to consider again.

            So, what's your idea?

            yong.fan nasf (Inactive) added a comment - Right, so to erase the side-affect of LU-966 patch, my patch works well. Otherwise, we have to fix up all related points one by one, and if someone add new points in the future, he/she has to consider again. So, what's your idea?

            this is result of LU-966 patch, unfortunately nobody noticed it ruins VBR recovery because mdt_object_find() may exits early without any VBR checking.

            tappro Mikhail Pershin added a comment - this is result of LU-966 patch, unfortunately nobody noticed it ruins VBR recovery because mdt_object_find() may exits early without any VBR checking.

            It is not interoperability issue, it can be reproduced against latest master branch by replay-vbr test_7c.

            yong.fan nasf (Inactive) added a comment - It is not interoperability issue, it can be reproduced against latest master branch by replay-vbr test_7c.

            Looking through server syslog:

            Lustre: 31370:0:(client.c:2530:ptlrpc_replay_interpret()) @@@ Version mismatch during replay
              req@ffff88031a8b5000 x1392520277794082/t657129996304(657129996304) o-1->lustre-MDT0000_UUID@192.168.4.134@o2ib:12/10 lens 472/424 e 0 to 0 dl 1328013402 ref 2 fl Interpret:R/ffffffff/ffffffff rc -75/-1
            Lustre: 31370:0:(client.c:2530:ptlrpc_replay_interpret()) Skipped 3 previous similar messages
            Lustre: 31370:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1392520277794320 sent from lustre-MDT0000-mdc-ffff880307c27800 to NID 192.168.4.134@o2ib has timed out for slow reply: [sent 1328013368] [real_sent 1328013368] [current 1328013402] [deadline 34s] [delay 0s]  req@ffff880316c8f800 x1392520277794320/t0(0) o-1->lustre-MDT0000_UUID@192.168.4.134@o2ib:12/10 lens 192/192 e 0 to 1 dl 1328013402 ref 1 fl Rpc:X/ffffffff/ffffffff rc 0/-1
            Lustre: 31370:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 37 previous similar messages
            Lustre: 31370:0:(import.c:1160:completed_replay_interpret()) lustre-MDT0000-mdc-ffff880307c27800: version recovery fails, reconnecting
            LustreError: 167-0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
            

            That shows the VBR works fine, mismatch was detected and recovery fails.

            but a bit later:

            Lustre: lustre-MDT0000-mdc-ffff88024b596400: Connection to service lustre-MDT0000 via nid 192.168.4.134@o2ib was lost; in progress operations using this service will wait for recovery to complete.
            Lustre: Skipped 9 previous similar messages
            Lustre: 31370:0:(import.c:852:ptlrpc_connect_interpret()) MGS@192.168.4.134@o2ib changed server handle from 0xa0636bc0947b83ea to 0xa0636bc0947b8a57
            LustreError: 31370:0:(client.c:2573:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff8802c7375800 x1392520277794470/t661424963601(661424963601) o-1->lustre-MDT0000_UUID@192.168.4.134@o2ib:12/10 lens 472/424 e 0 to 0 dl 1328013526 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1
            Lustre: lustre-MDT0000-mdc-ffff88024b596400: Connection restored to service lustre-MDT0000 using nid 192.168.4.134@o2ib
            

            I have no idea what is that so far

            tappro Mikhail Pershin added a comment - Looking through server syslog: Lustre: 31370:0:(client.c:2530:ptlrpc_replay_interpret()) @@@ Version mismatch during replay req@ffff88031a8b5000 x1392520277794082/t657129996304(657129996304) o-1->lustre-MDT0000_UUID@192.168.4.134@o2ib:12/10 lens 472/424 e 0 to 0 dl 1328013402 ref 2 fl Interpret:R/ffffffff/ffffffff rc -75/-1 Lustre: 31370:0:(client.c:2530:ptlrpc_replay_interpret()) Skipped 3 previous similar messages Lustre: 31370:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1392520277794320 sent from lustre-MDT0000-mdc-ffff880307c27800 to NID 192.168.4.134@o2ib has timed out for slow reply: [sent 1328013368] [real_sent 1328013368] [current 1328013402] [deadline 34s] [delay 0s] req@ffff880316c8f800 x1392520277794320/t0(0) o-1->lustre-MDT0000_UUID@192.168.4.134@o2ib:12/10 lens 192/192 e 0 to 1 dl 1328013402 ref 1 fl Rpc:X/ffffffff/ffffffff rc 0/-1 Lustre: 31370:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 37 previous similar messages Lustre: 31370:0:( import .c:1160:completed_replay_interpret()) lustre-MDT0000-mdc-ffff880307c27800: version recovery fails, reconnecting LustreError: 167-0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. That shows the VBR works fine, mismatch was detected and recovery fails. but a bit later: Lustre: lustre-MDT0000-mdc-ffff88024b596400: Connection to service lustre-MDT0000 via nid 192.168.4.134@o2ib was lost; in progress operations using this service will wait for recovery to complete. Lustre: Skipped 9 previous similar messages Lustre: 31370:0:( import .c:852:ptlrpc_connect_interpret()) MGS@192.168.4.134@o2ib changed server handle from 0xa0636bc0947b83ea to 0xa0636bc0947b8a57 LustreError: 31370:0:(client.c:2573:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff8802c7375800 x1392520277794470/t661424963601(661424963601) o-1->lustre-MDT0000_UUID@192.168.4.134@o2ib:12/10 lens 472/424 e 0 to 0 dl 1328013526 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1 Lustre: lustre-MDT0000-mdc-ffff88024b596400: Connection restored to service lustre-MDT0000 using nid 192.168.4.134@o2ib I have no idea what is that so far

            VBR detects ENOENT cases already and fails if some object is missing. For such object VBR count its version as ENOENT_VERSION and compare it with version in replay, FID_002 in your example. So there must be version mismatch. If that doesn't work for some reason, we need to find that reason. Keep in mind that this worked well for a long time but failed in 2.1<->2.1.55 case, so probably this is compatibility issue between lustre versions. First of all I'd check where that -ENOENT came from and why VBR checks missed it.

            tappro Mikhail Pershin added a comment - VBR detects ENOENT cases already and fails if some object is missing. For such object VBR count its version as ENOENT_VERSION and compare it with version in replay, FID_002 in your example. So there must be version mismatch. If that doesn't work for some reason, we need to find that reason. Keep in mind that this worked well for a long time but failed in 2.1<->2.1.55 case, so probably this is compatibility issue between lustre versions. First of all I'd check where that -ENOENT came from and why VBR checks missed it.
            yong.fan nasf (Inactive) added a comment - The patch for above issue: http://review.whamcloud.com/#change,2144

            There are some defects in current VBR implementation. For example in replay-vbr.sh test_7c

            test_7c() {
            ...
                first="createmany -o $DIR/$tdir/$tfile- 2"
                lost="rm $MOUNT2/$tdir/$tfile-0; mkdir $MOUNT2/$tdir/$tfile-0"
                last="mv $DIR/$tdir/$tfile-1 $DIR/$tdir/$tfile-0"
                test_7_cycle "$first" "$lost" "$last" || error "Test 7c.2 failed"
            ...
            }
            

            The operations sequence in test_7_cycle() is as following:

            (step 0) replay barrier
            (step 1) client1 "first"
            (step 2) client2 "lost"
            (step 3) client1 "last"
            (step 4) umount client2
            (step 5) MDS failover
            (step 6) client1 try to recover

            Since client2 umount before MDS failover, and client1's "last" operation depends on client2's "lost" operation, client1 is expected to fail to replay the "last" operation.

            But now we found client1 was not evicted after MDS failover. The reason is as following:

            The original $tfile-0 FID was "FID_001" when created by client1 step 1, then the client2 unlink $tfile-0, and mkdir with the same name, the new FID for $tfile-0 is "FID_002" by step 2. When client1 performed step 3, it used the $tfile-0's new "FID_002", and such FID also be used when client1 replayed "last" during MDS failover. But during the MDS failover, client2 missed, then nobody re-created $tfile-0 with new "FID_002", so when client1 tried to replay "last" during MDS failover, it could not find the target $tfile-0 with "FID_002". Under such case, client1's recovery should be regarded as failure, and client1 should be evicted. But in current implementation, during the VBR phase, only evict the client with VBR failure (for object version unmatched cases). test_7c failed for "-ENOENT", no chance to compare the versions yet, so client1 was not evicted.

            The simplest way to resolve the issue is to regard all target missed cases during recovery as VBR failure, then evict related client.

            yong.fan nasf (Inactive) added a comment - There are some defects in current VBR implementation. For example in replay-vbr.sh test_7c test_7c() { ... first= "createmany -o $DIR/$tdir/$tfile- 2" lost= "rm $MOUNT2/$tdir/$tfile-0; mkdir $MOUNT2/$tdir/$tfile-0" last= "mv $DIR/$tdir/$tfile-1 $DIR/$tdir/$tfile-0" test_7_cycle "$first" "$lost" "$last" || error "Test 7c.2 failed" ... } The operations sequence in test_7_cycle() is as following: (step 0) replay barrier (step 1) client1 "first" (step 2) client2 "lost" (step 3) client1 "last" (step 4) umount client2 (step 5) MDS failover (step 6) client1 try to recover Since client2 umount before MDS failover, and client1's "last" operation depends on client2's "lost" operation, client1 is expected to fail to replay the "last" operation. But now we found client1 was not evicted after MDS failover. The reason is as following: The original $tfile-0 FID was "FID_001" when created by client1 step 1, then the client2 unlink $tfile-0, and mkdir with the same name, the new FID for $tfile-0 is "FID_002" by step 2. When client1 performed step 3, it used the $tfile-0's new "FID_002", and such FID also be used when client1 replayed "last" during MDS failover. But during the MDS failover, client2 missed, then nobody re-created $tfile-0 with new "FID_002", so when client1 tried to replay "last" during MDS failover, it could not find the target $tfile-0 with "FID_002". Under such case, client1's recovery should be regarded as failure, and client1 should be evicted. But in current implementation, during the VBR phase, only evict the client with VBR failure (for object version unmatched cases). test_7c failed for "-ENOENT", no chance to compare the versions yet, so client1 was not evicted. The simplest way to resolve the issue is to regard all target missed cases during recovery as VBR failure, then evict related client.
            yujian Jian Yu added a comment -

            Lustre Clients:
            Tag: 2.1.0
            Distro/Arch: RHEL5/x86_64 (kernel version: 2.6.18-238.19.1.el5)
            Network: IB (in-kernel OFED)
            ENABLE_QUOTA=yes

            Lustre Servers:
            Tag: v2_1_1_0_RC1
            Distro/Arch: RHEL5/x86_64 (kernel version: 2.6.18-274.12.1.el5_lustre)
            Build: http://build.whamcloud.com/job/lustre-b2_1/37/arch=x86_64,build_type=server,distro=el5,ib_stack=inkernel/
            Network: IB (in-kernel OFED)

            The same issue occurred:
            https://maloo.whamcloud.com/test_sets/68802da0-54bc-11e1-9fd4-5254004bbbd3

            yujian Jian Yu added a comment - Lustre Clients: Tag: 2.1.0 Distro/Arch: RHEL5/x86_64 (kernel version: 2.6.18-238.19.1.el5) Network: IB (in-kernel OFED) ENABLE_QUOTA=yes Lustre Servers: Tag: v2_1_1_0_RC1 Distro/Arch: RHEL5/x86_64 (kernel version: 2.6.18-274.12.1.el5_lustre) Build: http://build.whamcloud.com/job/lustre-b2_1/37/arch=x86_64,build_type=server,distro=el5,ib_stack=inkernel/ Network: IB (in-kernel OFED) The same issue occurred: https://maloo.whamcloud.com/test_sets/68802da0-54bc-11e1-9fd4-5254004bbbd3

            People

              yong.fan nasf (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: