[LU-1060] Test failure on test suite replay-vbr, subtest test_7c Created: 31/Jan/12  Updated: 16/Feb/12  Resolved: 16/Feb/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0, Lustre 2.1.1
Fix Version/s: Lustre 2.2.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: nasf (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Attachments: File 1060.tar.gz    
Issue Links:
Related
is related to LU-966 post-fsck MDS LBUG during recovery du... Resolved
Severity: 3
Rank (Obsolete): 6473

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/c95fc5c2-4c42-11e1-bd50-5254004bbbd3.

The sub-test test_7c failed with the following error:

Test 7c.2 failed

Info required for matching: replay-vbr 7c



 Comments   
Comment by Sarah Liu [ 31/Jan/12 ]

dmesg and debug log from servers

Comment by Peter Jones [ 08/Feb/12 ]

Fanyong

Could you please look into this 2.2 blocker?

Thanks

Peter

Comment by Jian Yu [ 10/Feb/12 ]

Lustre Tag: v2_1_1_0_RC1
Distro/Arch: RHEL5.7 (server), SLES11SP1 (client)
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/37/
Network: TCP

The same issue occurred:
https://maloo.whamcloud.com/test_sets/e2334988-5300-11e1-81a2-5254004bbbd3

Comment by Jian Yu [ 12/Feb/12 ]

Lustre Clients:
Tag: 2.1.0
Distro/Arch: RHEL5/x86_64 (kernel version: 2.6.18-238.19.1.el5)
Network: IB (in-kernel OFED)
ENABLE_QUOTA=yes

Lustre Servers:
Tag: v2_1_1_0_RC1
Distro/Arch: RHEL5/x86_64 (kernel version: 2.6.18-274.12.1.el5_lustre)
Build: http://build.whamcloud.com/job/lustre-b2_1/37/arch=x86_64,build_type=server,distro=el5,ib_stack=inkernel/
Network: IB (in-kernel OFED)

The same issue occurred:
https://maloo.whamcloud.com/test_sets/68802da0-54bc-11e1-9fd4-5254004bbbd3

Comment by nasf (Inactive) [ 14/Feb/12 ]

There are some defects in current VBR implementation. For example in replay-vbr.sh test_7c

test_7c() {
...
    first="createmany -o $DIR/$tdir/$tfile- 2"
    lost="rm $MOUNT2/$tdir/$tfile-0; mkdir $MOUNT2/$tdir/$tfile-0"
    last="mv $DIR/$tdir/$tfile-1 $DIR/$tdir/$tfile-0"
    test_7_cycle "$first" "$lost" "$last" || error "Test 7c.2 failed"
...
}

The operations sequence in test_7_cycle() is as following:

(step 0) replay barrier
(step 1) client1 "first"
(step 2) client2 "lost"
(step 3) client1 "last"
(step 4) umount client2
(step 5) MDS failover
(step 6) client1 try to recover

Since client2 umount before MDS failover, and client1's "last" operation depends on client2's "lost" operation, client1 is expected to fail to replay the "last" operation.

But now we found client1 was not evicted after MDS failover. The reason is as following:

The original $tfile-0 FID was "FID_001" when created by client1 step 1, then the client2 unlink $tfile-0, and mkdir with the same name, the new FID for $tfile-0 is "FID_002" by step 2. When client1 performed step 3, it used the $tfile-0's new "FID_002", and such FID also be used when client1 replayed "last" during MDS failover. But during the MDS failover, client2 missed, then nobody re-created $tfile-0 with new "FID_002", so when client1 tried to replay "last" during MDS failover, it could not find the target $tfile-0 with "FID_002". Under such case, client1's recovery should be regarded as failure, and client1 should be evicted. But in current implementation, during the VBR phase, only evict the client with VBR failure (for object version unmatched cases). test_7c failed for "-ENOENT", no chance to compare the versions yet, so client1 was not evicted.

The simplest way to resolve the issue is to regard all target missed cases during recovery as VBR failure, then evict related client.

Comment by nasf (Inactive) [ 14/Feb/12 ]

The patch for above issue:

http://review.whamcloud.com/#change,2144

Comment by Mikhail Pershin [ 15/Feb/12 ]

VBR detects ENOENT cases already and fails if some object is missing. For such object VBR count its version as ENOENT_VERSION and compare it with version in replay, FID_002 in your example. So there must be version mismatch. If that doesn't work for some reason, we need to find that reason. Keep in mind that this worked well for a long time but failed in 2.1<->2.1.55 case, so probably this is compatibility issue between lustre versions. First of all I'd check where that -ENOENT came from and why VBR checks missed it.

Comment by Mikhail Pershin [ 15/Feb/12 ]

Looking through server syslog:

Lustre: 31370:0:(client.c:2530:ptlrpc_replay_interpret()) @@@ Version mismatch during replay
  req@ffff88031a8b5000 x1392520277794082/t657129996304(657129996304) o-1->lustre-MDT0000_UUID@192.168.4.134@o2ib:12/10 lens 472/424 e 0 to 0 dl 1328013402 ref 2 fl Interpret:R/ffffffff/ffffffff rc -75/-1
Lustre: 31370:0:(client.c:2530:ptlrpc_replay_interpret()) Skipped 3 previous similar messages
Lustre: 31370:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1392520277794320 sent from lustre-MDT0000-mdc-ffff880307c27800 to NID 192.168.4.134@o2ib has timed out for slow reply: [sent 1328013368] [real_sent 1328013368] [current 1328013402] [deadline 34s] [delay 0s]  req@ffff880316c8f800 x1392520277794320/t0(0) o-1->lustre-MDT0000_UUID@192.168.4.134@o2ib:12/10 lens 192/192 e 0 to 1 dl 1328013402 ref 1 fl Rpc:X/ffffffff/ffffffff rc 0/-1
Lustre: 31370:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 37 previous similar messages
Lustre: 31370:0:(import.c:1160:completed_replay_interpret()) lustre-MDT0000-mdc-ffff880307c27800: version recovery fails, reconnecting
LustreError: 167-0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.

That shows the VBR works fine, mismatch was detected and recovery fails.

but a bit later:

Lustre: lustre-MDT0000-mdc-ffff88024b596400: Connection to service lustre-MDT0000 via nid 192.168.4.134@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Lustre: Skipped 9 previous similar messages
Lustre: 31370:0:(import.c:852:ptlrpc_connect_interpret()) MGS@192.168.4.134@o2ib changed server handle from 0xa0636bc0947b83ea to 0xa0636bc0947b8a57
LustreError: 31370:0:(client.c:2573:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff8802c7375800 x1392520277794470/t661424963601(661424963601) o-1->lustre-MDT0000_UUID@192.168.4.134@o2ib:12/10 lens 472/424 e 0 to 0 dl 1328013526 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1
Lustre: lustre-MDT0000-mdc-ffff88024b596400: Connection restored to service lustre-MDT0000 using nid 192.168.4.134@o2ib

I have no idea what is that so far

Comment by nasf (Inactive) [ 15/Feb/12 ]

It is not interoperability issue, it can be reproduced against latest master branch by replay-vbr test_7c.

Comment by Mikhail Pershin [ 15/Feb/12 ]

this is result of LU-966 patch, unfortunately nobody noticed it ruins VBR recovery because mdt_object_find() may exits early without any VBR checking.

Comment by nasf (Inactive) [ 15/Feb/12 ]

Right, so to erase the side-affect of LU-966 patch, my patch works well. Otherwise, we have to fix up all related points one by one, and if someone add new points in the future, he/she has to consider again.

So, what's your idea?

Comment by Mikhail Pershin [ 15/Feb/12 ]

LU-1060 is caused by LU-966 improper fix.

Comment by Mikhail Pershin [ 15/Feb/12 ]

Right, your patch will work to cover some case but it is just fast fix to hide bad effects of previous wrong patch, that is the way we shouldn't go for sure. It hides side-effects but doesn't fix the root cause, moreover it doesn't fix broken VBR which can cause unneeded evictions after LU-966. The right way will be step back to the point where all issues appeared - LU-966.

Basically we need to revert that patch and apply its first version - just replace assertions with error checks, it straight-forward and easy to follow. The idea to make that early in MDT was wrong and we missed that, any further attempts to fix that in MDT will cause more complexity there and more checks. I've made patch already:
http://review.whamcloud.com/#change,2148

Another my worry is about test set for master review testing, I don't get why it misses replay-vbr and runtests which are pretty good tests. LU-1060 appeared right after LU-966 landing and nobody noticed that. This is out of scope this bug though.

Comment by nasf (Inactive) [ 15/Feb/12 ]

Removing the patch for LU-966 directly may be not the best solution. If you do not like my patch for LU-1060, we can fix it case by case. I think Bobijam is working on such patch to erase the side-effect of his LU-966 patch.

Comment by Zhenyu Xu [ 15/Feb/12 ]

the patch to let vbr version check replay non-exist object is posted at http://review.whamcloud.com/2149

description: For replay cases, mdt_version_get_check will check non-exist mdt object and evict clients accordingly, but mdt_object_find will not set exp_vbr_failed and will not evict the faulty client.

Comment by Peter Jones [ 16/Feb/12 ]

duplicate of lu-966

Generated at Sat Feb 10 01:13:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.