[LU-1060] Test failure on test suite replay-vbr, subtest test_7c Created: 31/Jan/12 Updated: 16/Feb/12 Resolved: 16/Feb/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0, Lustre 2.1.1 |
| Fix Version/s: | Lustre 2.2.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | nasf (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 6473 | ||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/c95fc5c2-4c42-11e1-bd50-5254004bbbd3. The sub-test test_7c failed with the following error:
Info required for matching: replay-vbr 7c |
| Comments |
| Comment by Sarah Liu [ 31/Jan/12 ] |
|
dmesg and debug log from servers |
| Comment by Peter Jones [ 08/Feb/12 ] |
|
Fanyong Could you please look into this 2.2 blocker? Thanks Peter |
| Comment by Jian Yu [ 10/Feb/12 ] |
|
Lustre Tag: v2_1_1_0_RC1 The same issue occurred: |
| Comment by Jian Yu [ 12/Feb/12 ] |
|
Lustre Clients: Lustre Servers: The same issue occurred: |
| Comment by nasf (Inactive) [ 14/Feb/12 ] |
|
There are some defects in current VBR implementation. For example in replay-vbr.sh test_7c test_7c() {
...
first="createmany -o $DIR/$tdir/$tfile- 2"
lost="rm $MOUNT2/$tdir/$tfile-0; mkdir $MOUNT2/$tdir/$tfile-0"
last="mv $DIR/$tdir/$tfile-1 $DIR/$tdir/$tfile-0"
test_7_cycle "$first" "$lost" "$last" || error "Test 7c.2 failed"
...
}
The operations sequence in test_7_cycle() is as following: (step 0) replay barrier Since client2 umount before MDS failover, and client1's "last" operation depends on client2's "lost" operation, client1 is expected to fail to replay the "last" operation. But now we found client1 was not evicted after MDS failover. The reason is as following: The original $tfile-0 FID was "FID_001" when created by client1 step 1, then the client2 unlink $tfile-0, and mkdir with the same name, the new FID for $tfile-0 is "FID_002" by step 2. When client1 performed step 3, it used the $tfile-0's new "FID_002", and such FID also be used when client1 replayed "last" during MDS failover. But during the MDS failover, client2 missed, then nobody re-created $tfile-0 with new "FID_002", so when client1 tried to replay "last" during MDS failover, it could not find the target $tfile-0 with "FID_002". Under such case, client1's recovery should be regarded as failure, and client1 should be evicted. But in current implementation, during the VBR phase, only evict the client with VBR failure (for object version unmatched cases). test_7c failed for "-ENOENT", no chance to compare the versions yet, so client1 was not evicted. The simplest way to resolve the issue is to regard all target missed cases during recovery as VBR failure, then evict related client. |
| Comment by nasf (Inactive) [ 14/Feb/12 ] |
|
The patch for above issue: |
| Comment by Mikhail Pershin [ 15/Feb/12 ] |
|
VBR detects ENOENT cases already and fails if some object is missing. For such object VBR count its version as ENOENT_VERSION and compare it with version in replay, FID_002 in your example. So there must be version mismatch. If that doesn't work for some reason, we need to find that reason. Keep in mind that this worked well for a long time but failed in 2.1<->2.1.55 case, so probably this is compatibility issue between lustre versions. First of all I'd check where that -ENOENT came from and why VBR checks missed it. |
| Comment by Mikhail Pershin [ 15/Feb/12 ] |
|
Looking through server syslog: Lustre: 31370:0:(client.c:2530:ptlrpc_replay_interpret()) @@@ Version mismatch during replay req@ffff88031a8b5000 x1392520277794082/t657129996304(657129996304) o-1->lustre-MDT0000_UUID@192.168.4.134@o2ib:12/10 lens 472/424 e 0 to 0 dl 1328013402 ref 2 fl Interpret:R/ffffffff/ffffffff rc -75/-1 Lustre: 31370:0:(client.c:2530:ptlrpc_replay_interpret()) Skipped 3 previous similar messages Lustre: 31370:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1392520277794320 sent from lustre-MDT0000-mdc-ffff880307c27800 to NID 192.168.4.134@o2ib has timed out for slow reply: [sent 1328013368] [real_sent 1328013368] [current 1328013402] [deadline 34s] [delay 0s] req@ffff880316c8f800 x1392520277794320/t0(0) o-1->lustre-MDT0000_UUID@192.168.4.134@o2ib:12/10 lens 192/192 e 0 to 1 dl 1328013402 ref 1 fl Rpc:X/ffffffff/ffffffff rc 0/-1 Lustre: 31370:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 37 previous similar messages Lustre: 31370:0:(import.c:1160:completed_replay_interpret()) lustre-MDT0000-mdc-ffff880307c27800: version recovery fails, reconnecting LustreError: 167-0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. That shows the VBR works fine, mismatch was detected and recovery fails. but a bit later: Lustre: lustre-MDT0000-mdc-ffff88024b596400: Connection to service lustre-MDT0000 via nid 192.168.4.134@o2ib was lost; in progress operations using this service will wait for recovery to complete. Lustre: Skipped 9 previous similar messages Lustre: 31370:0:(import.c:852:ptlrpc_connect_interpret()) MGS@192.168.4.134@o2ib changed server handle from 0xa0636bc0947b83ea to 0xa0636bc0947b8a57 LustreError: 31370:0:(client.c:2573:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff8802c7375800 x1392520277794470/t661424963601(661424963601) o-1->lustre-MDT0000_UUID@192.168.4.134@o2ib:12/10 lens 472/424 e 0 to 0 dl 1328013526 ref 2 fl Interpret:R/ffffffff/ffffffff rc -2/-1 Lustre: lustre-MDT0000-mdc-ffff88024b596400: Connection restored to service lustre-MDT0000 using nid 192.168.4.134@o2ib I have no idea what is that so far |
| Comment by nasf (Inactive) [ 15/Feb/12 ] |
|
It is not interoperability issue, it can be reproduced against latest master branch by replay-vbr test_7c. |
| Comment by Mikhail Pershin [ 15/Feb/12 ] |
|
this is result of |
| Comment by nasf (Inactive) [ 15/Feb/12 ] |
|
Right, so to erase the side-affect of So, what's your idea? |
| Comment by Mikhail Pershin [ 15/Feb/12 ] |
| Comment by Mikhail Pershin [ 15/Feb/12 ] |
|
Right, your patch will work to cover some case but it is just fast fix to hide bad effects of previous wrong patch, that is the way we shouldn't go for sure. It hides side-effects but doesn't fix the root cause, moreover it doesn't fix broken VBR which can cause unneeded evictions after Basically we need to revert that patch and apply its first version - just replace assertions with error checks, it straight-forward and easy to follow. The idea to make that early in MDT was wrong and we missed that, any further attempts to fix that in MDT will cause more complexity there and more checks. I've made patch already: Another my worry is about test set for master review testing, I don't get why it misses replay-vbr and runtests which are pretty good tests. |
| Comment by nasf (Inactive) [ 15/Feb/12 ] |
|
Removing the patch for |
| Comment by Zhenyu Xu [ 15/Feb/12 ] |
|
the patch to let vbr version check replay non-exist object is posted at http://review.whamcloud.com/2149 description: For replay cases, mdt_version_get_check will check non-exist mdt object and evict clients accordingly, but mdt_object_find will not set exp_vbr_failed and will not evict the faulty client. |
| Comment by Peter Jones [ 16/Feb/12 ] |
|
duplicate of lu-966 |