[LU-661] replay-dual test_0b: @@@@@@ FAIL: test_0b failed with 1 Created: 05/Sep/11 Updated: 16/Apr/13 Resolved: 28/Jan/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 2.3.0, Lustre 2.4.0, Lustre 2.1.4 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | nasf (Inactive) | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | MB | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 5213 | ||||
| Description |
|
The test is as following: test_0b() {
1) replay_barrier $SINGLEMDS
2) touch $MOUNT2/$tfile
3) touch $MOUNT1/$tfile-2
4) umount $MOUNT2
5) facet_failover $SINGLEMDS
6) umount -f $MOUNT1
7) zconf_mount `hostname` $MOUNT1 || error "mount1 fais"
8) zconf_mount `hostname` $MOUNT2 || error "mount2 fais"
9) checkstat $MOUNT1/$tfile-2 && return 1
10) checkstat $MOUNT2/$tfile && return 2
11) return 0
}
run_test 0b "lost client during waiting for next transno"
Currently, with VBR enabled, before step 6), whether client1's requests are replayed or not is uncertain. If replayed, then "$MOUNT1/$tfile-2" should exist, then check 9) should fail. So the check 9) is incorrect and unnecessary. Failure log: ... Starting client: client-22vm1.lab.whamcloud.com: -o user_xattr,acl client-22vm3@tcp:/lustre /mnt/lustre debug=-1 subsystem_debug=0xffb7e3ff debug_mb=32 Starting client: client-22vm1.lab.whamcloud.com: -o user_xattr,acl client-22vm3@tcp:/lustre /mnt/lustre2 debug=-1 subsystem_debug=0xffb7e3ff debug_mb=32 replay-dual test_0b: @@@@@@ FAIL: test_0b failed with 1 ... As shown on MDS side: ... Lustre: 23777:0:(ldlm_lib.c:2029:target_queue_recovery_request()) Next recovery transno: 8589934645, current: 8589934653, replaying ... Client1's open_create request was replayed, so above check 9) failed. Info required for matching: replay-dual 0b |
| Comments |
| Comment by Andreas Dilger [ 05/Sep/11 ] |
|
It would be enough to change the test to allow $tfile-2 to exist if VBR is in the connect flags for the MDS. |
| Comment by James A Simmons [ 07/Sep/11 ] |
|
Is this related to |
| Comment by Mikhail Pershin [ 07/Sep/11 ] |
|
James, in your report the failure is different: replay-dual test_0b: @@@@@@ FAIL: mount1 fais it is not the same issue and doesn't look related at least now |
| Comment by Mikhail Pershin [ 07/Sep/11 ] |
|
the question is why that file doesn't exist all time. I need to check this closely before changing test. Maybe client doesn't start recovery yet so no replay were done, but that must be checked |
| Comment by Mikhail Pershin [ 22/Nov/11 ] |
| Comment by James A Simmons [ 17/Feb/12 ] |
|
Is this fix needed any more? |
| Comment by James A Simmons [ 08/Mar/12 ] |
|
I see the patch was abandoned. I assume this ticket can be closed now? |
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/May/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Andreas Dilger [ 31/May/12 ] |
|
Should this bug be closed? |
| Comment by Jian Yu [ 15/Oct/12 ] |
|
Lustre Tag: v2_3_0_RC3 The same issue occurred again: == replay-dual test 0b: lost client during waiting for next transno ================================== 15:39:29 (1350254369)
Filesystem 1K-blocks Used Available Use% Mounted on
10.10.4.133@tcp:/lustre
13779696 740436 12339196 6% /mnt/lustre
Failing mds1 on node fat-amd-2
Stopping /mnt/mds1 (opts:) on fat-amd-2
affected facets: mds1
Failover mds1 to fat-amd-2
15:39:49 (1350254389) waiting for fat-amd-2 network 900 secs ...
15:39:49 (1350254389) network interface is UP
Starting mds1: /dev/sdc5 /mnt/mds1
Started lustre-MDT0000
Starting client: client-5: -o user_xattr,flock fat-amd-2@tcp:/lustre /mnt/lustre
Starting client: client-5: -o user_xattr,flock fat-amd-2@tcp:/lustre /mnt/lustre2
replay-dual test_0b: @@@@@@ FAIL: test_0b failed with 1
|
| Comment by Sarah Liu [ 06/Nov/12 ] |
|
not sure if this is another instance: https://maloo.whamcloud.com/test_sets/bc3141ae-2708-11e2-b04c-52540035b04c lustre master build #1017 SLES11 SP2 client |
| Comment by Sarah Liu [ 04/Dec/12 ] |
|
another failure instance on SLES11 SP2 https://maloo.whamcloud.com/test_sets/5baf3a52-3d56-11e2-9127-52540035b04c |
| Comment by Jian Yu [ 10/Dec/12 ] |
|
Lustre Branch: b2_1 The same issue occurred: |
| Comment by Sarah Liu [ 03/Jan/13 ] |
|
another instance found in tag 2.3.58 RHEL6 server and client with IB |
| Comment by Mikhail Pershin [ 10/Jan/13 ] |
|
patch was landed just to the orion branch but not into the master, I'll refresh it |
| Comment by Mikhail Pershin [ 11/Jan/13 ] |
| Comment by Sarah Liu [ 21/Jan/13 ] |
|
lustre-master build #1176 hit this error in ofd build testing: https://maloo.whamcloud.com/test_sets/570da3a8-62c7-11e2-982f-52540035b04c |
| Comment by Mikhail Pershin [ 28/Jan/13 ] |
|
patch was landed |