[LU-661] replay-dual test_0b: @@@@@@ FAIL: test_0b failed with 1 Created: 05/Sep/11  Updated: 16/Apr/13  Resolved: 28/Jan/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0, Lustre 2.3.0, Lustre 2.4.0, Lustre 2.1.4
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: nasf (Inactive) Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: MB

Issue Links:
Duplicate
Severity: 3
Rank (Obsolete): 5213

 Description   

The test is as following:

test_0b() {
1)   replay_barrier $SINGLEMDS
2)   touch $MOUNT2/$tfile
3)   touch $MOUNT1/$tfile-2
4)   umount $MOUNT2
5)   facet_failover $SINGLEMDS
6)   umount -f $MOUNT1
7)   zconf_mount `hostname` $MOUNT1 || error "mount1 fais"
8)   zconf_mount `hostname` $MOUNT2 || error "mount2 fais"
9)   checkstat $MOUNT1/$tfile-2 && return 1
10)  checkstat $MOUNT2/$tfile && return 2
11)  return 0
}
run_test 0b "lost client during waiting for next transno"

Currently, with VBR enabled, before step 6), whether client1's requests are replayed or not is uncertain. If replayed, then "$MOUNT1/$tfile-2" should exist, then check 9) should fail. So the check 9) is incorrect and unnecessary.

Failure log:
https://maloo.whamcloud.com/test_sets/1e5580d0-d0ba-11e0-8d02-52540025f9af

...
Starting client: client-22vm1.lab.whamcloud.com: -o user_xattr,acl client-22vm3@tcp:/lustre /mnt/lustre
debug=-1
subsystem_debug=0xffb7e3ff
debug_mb=32
Starting client: client-22vm1.lab.whamcloud.com: -o user_xattr,acl client-22vm3@tcp:/lustre /mnt/lustre2
debug=-1
subsystem_debug=0xffb7e3ff
debug_mb=32
 replay-dual test_0b: @@@@@@ FAIL: test_0b failed with 1 
...

As shown on MDS side:
https://maloo.whamcloud.com/test_logs/2675b6b8-d0ba-11e0-8d02-52540025f9af

...
Lustre: 23777:0:(ldlm_lib.c:2029:target_queue_recovery_request()) Next recovery transno: 8589934645, current: 8589934653, replaying
...

Client1's open_create request was replayed, so above check 9) failed.

Info required for matching: replay-dual 0b



 Comments   
Comment by Andreas Dilger [ 05/Sep/11 ]

It would be enough to change the test to allow $tfile-2 to exist if VBR is in the connect flags for the MDS.

Comment by James A Simmons [ 07/Sep/11 ]

Is this related to LU-639 I reported?

Comment by Mikhail Pershin [ 07/Sep/11 ]

James, in your report the failure is different:

replay-dual test_0b: @@@@@@ FAIL: mount1 fais

it is not the same issue and doesn't look related at least now

Comment by Mikhail Pershin [ 07/Sep/11 ]

the question is why that file doesn't exist all time. I need to check this closely before changing test. Maybe client doesn't start recovery yet so no replay were done, but that must be checked

Comment by Mikhail Pershin [ 22/Nov/11 ]

http://review.whamcloud.com/1724

Comment by James A Simmons [ 17/Feb/12 ]

Is this fix needed any more?

Comment by James A Simmons [ 08/Mar/12 ]

I see the patch was abandoned. I assume this ticket can be closed now?

Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » x86_64,client,el5,inkernel #340
LU-661 tests: ORI-422 don't check file-2 in replay-dual 0b (Revision ca0173956dd99e4af5e1320e45a258c41d53e998)

Result = SUCCESS
Mikhail Pershin : ca0173956dd99e4af5e1320e45a258c41d53e998
Files :

  • lustre/tests/replay-dual.sh
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » i686,client,el6,inkernel #340
LU-661 tests: ORI-422 don't check file-2 in replay-dual 0b (Revision ca0173956dd99e4af5e1320e45a258c41d53e998)

Result = SUCCESS
Mikhail Pershin : ca0173956dd99e4af5e1320e45a258c41d53e998
Files :

  • lustre/tests/replay-dual.sh
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » i686,server,el5,inkernel #340
LU-661 tests: ORI-422 don't check file-2 in replay-dual 0b (Revision ca0173956dd99e4af5e1320e45a258c41d53e998)

Result = SUCCESS
Mikhail Pershin : ca0173956dd99e4af5e1320e45a258c41d53e998
Files :

  • lustre/tests/replay-dual.sh
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » x86_64,server,el6,inkernel #340
LU-661 tests: ORI-422 don't check file-2 in replay-dual 0b (Revision ca0173956dd99e4af5e1320e45a258c41d53e998)

Result = SUCCESS
Mikhail Pershin : ca0173956dd99e4af5e1320e45a258c41d53e998
Files :

  • lustre/tests/replay-dual.sh
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » i686,client,el5,inkernel #340
LU-661 tests: ORI-422 don't check file-2 in replay-dual 0b (Revision ca0173956dd99e4af5e1320e45a258c41d53e998)

Result = SUCCESS
Mikhail Pershin : ca0173956dd99e4af5e1320e45a258c41d53e998
Files :

  • lustre/tests/replay-dual.sh
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » x86_64,server,el5,inkernel #340
LU-661 tests: ORI-422 don't check file-2 in replay-dual 0b (Revision ca0173956dd99e4af5e1320e45a258c41d53e998)

Result = SUCCESS
Mikhail Pershin : ca0173956dd99e4af5e1320e45a258c41d53e998
Files :

  • lustre/tests/replay-dual.sh
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » x86_64,client,el6,inkernel #340
LU-661 tests: ORI-422 don't check file-2 in replay-dual 0b (Revision ca0173956dd99e4af5e1320e45a258c41d53e998)

Result = SUCCESS
Mikhail Pershin : ca0173956dd99e4af5e1320e45a258c41d53e998
Files :

  • lustre/tests/replay-dual.sh
Comment by Andreas Dilger [ 31/May/12 ]

Should this bug be closed?

Comment by Jian Yu [ 15/Oct/12 ]

Lustre Tag: v2_3_0_RC3
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/36
Distro/Arch: RHEL6.3/x86_64(server), FC15/x86_64(client)
Network: TCP
ENABLE_QUOTA=yes

The same issue occurred again:
https://maloo.whamcloud.com/test_sets/1bb4a3ca-167c-11e2-80d0-52540035b04c

== replay-dual test 0b: lost client during waiting for next transno ================================== 15:39:29 (1350254369)
Filesystem           1K-blocks      Used Available Use% Mounted on
10.10.4.133@tcp:/lustre
                      13779696    740436  12339196   6% /mnt/lustre
Failing mds1 on node fat-amd-2
Stopping /mnt/mds1 (opts:) on fat-amd-2
affected facets: mds1
Failover mds1 to fat-amd-2
15:39:49 (1350254389) waiting for fat-amd-2 network 900 secs ...
15:39:49 (1350254389) network interface is UP
Starting mds1:   /dev/sdc5 /mnt/mds1
Started lustre-MDT0000
Starting client: client-5: -o user_xattr,flock fat-amd-2@tcp:/lustre /mnt/lustre
Starting client: client-5: -o user_xattr,flock fat-amd-2@tcp:/lustre /mnt/lustre2
 replay-dual test_0b: @@@@@@ FAIL: test_0b failed with 1 
Comment by Sarah Liu [ 06/Nov/12 ]

not sure if this is another instance: https://maloo.whamcloud.com/test_sets/bc3141ae-2708-11e2-b04c-52540035b04c

lustre master build #1017 SLES11 SP2 client

Comment by Sarah Liu [ 04/Dec/12 ]

another failure instance on SLES11 SP2 https://maloo.whamcloud.com/test_sets/5baf3a52-3d56-11e2-9127-52540035b04c

Comment by Jian Yu [ 10/Dec/12 ]

Lustre Branch: b2_1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/148
Distro/Arch: RHEL5.8/x86_64 (kernel version: 2.6.18-308.20.1.el5)
Network: TCP (1GigE)

The same issue occurred:
https://maloo.whamcloud.com/test_sets/d453c39a-41bd-11e2-a653-52540035b04c

Comment by Sarah Liu [ 03/Jan/13 ]

another instance found in tag 2.3.58 RHEL6 server and client with IB
https://maloo.whamcloud.com/test_sets/4dddf3dc-55a7-11e2-88af-52540035b04c

Comment by Mikhail Pershin [ 10/Jan/13 ]

patch was landed just to the orion branch but not into the master, I'll refresh it

Comment by Mikhail Pershin [ 11/Jan/13 ]

http://review.whamcloud.com/4999

Comment by Sarah Liu [ 21/Jan/13 ]

lustre-master build #1176 hit this error in ofd build testing:

https://maloo.whamcloud.com/test_sets/570da3a8-62c7-11e2-982f-52540035b04c

Comment by Mikhail Pershin [ 28/Jan/13 ]

patch was landed

Generated at Sat Feb 10 01:09:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.