[LU-1458] lustre-rsync-test test_2b: old lustre_rsync does not work with new llog_changelog_ext_rec remove changelog Created: 31/May/12  Updated: 18/Mar/14  Resolved: 18/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.1.2, Lustre 2.4.1, Lustre 2.5.0, Lustre 2.5.1
Fix Version/s: Lustre 2.4.0, Lustre 2.6.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: yuc2

Attachments: File 1458.tar.gz     File lustre-rsync-test.test_1.changelog     File lustre-rsync-test.test_2b.changelog    
Issue Links:
Related
is related to LU-1331 changelogs: RNMTO record not always a... Resolved
is related to LU-4781 lustre-rsync-test test_2b: Replicatio... Resolved
is related to LU-1442 File corrupt with 1MiB-aligned 4k reg... Closed
is related to LU-1237 2.1.1<->2.2 interop Test failure on t... Resolved
Severity: 3
Rank (Obsolete): 4107

 Description   

This issue was created by maloo for yujian <yujian@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/eb3d7ed4-ab13-11e1-8e7f-52540035b04c.

The sub-test test_2b failed with the following error:

Only in /mnt/lustre/d0.lustre-rsync-test/d2/clients/client1/~dmtmp/PM: PMD394.TMP
lustre-rsync-test test_2b: @@@@@@ FAIL: Failure in replication; differences found.
test failed to respond and timed out

Info required for matching: lustre-rsync-test 2b



 Comments   
Comment by Peter Jones [ 31/May/12 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Zhenyu Xu [ 01/Jun/12 ]

I've tried on my VM machine, cannot hit the issue.

ps, The test process is:

1. run dbench on lustre direcotry.
2. use lustre_rsync to replicate lustre directory contents to another destination dir.
3. check whether source lustre directory differs from destination dir.

This auto test case shows that /mnt/lustre/d0.lustre-rsync-test/d2/clients/client1/~dmtmp/PM/PMD394.TMP differs from its lustre_rsync-ed destination directory /tmp/target.

Comment by Sarah Liu [ 05/Jun/12 ]

I am not sure if this is the same issue:https://maloo.whamcloud.com/test_sets/81757bd0-ad72-11e1-8152-52540035b04c

client: 2.1.1-rhel6
server: lustre-master-tag-2.2.54-rhel6

Comment by Zhenyu Xu [ 05/Jun/12 ]

Sarah,

it's a different issue, and I've created a ticket for it (LU-1483)

Comment by Li Wei (Inactive) [ 23/Jun/12 ]

https://maloo.whamcloud.com/test_sets/70d74642-bc21-11e1-8a1f-52540035b04c

Comment by Zhenyu Xu [ 23/Jul/12 ]

could possibly related.

Comment by Oleg Drokin [ 02/Aug/12 ]

Bobi, so the problem is not in the file content difference.

The problem is this file only exists in the source dir, so it was not copied to target dir at all.
So we need to find out why did that happen.

Comment by Zhenyu Xu [ 06/Aug/12 ]

I'll upload a debug patch to dump the changelog in plain text.

When encounters another hit, would it possible to upload the changelog file as well? It contains all changelog records which lustre_rsync uses to replicate lustre source dir.

Comment by Zhenyu Xu [ 07/Aug/12 ]

debug improvement patch tracking at http://review.whamcloud.com/3551

patch description
LU-1458 test: dump changelog for lustre-rsync-test

Dump plain text format changelog records for failed lustre-rsync-test
test case to help debugging.
Comment by Peter Jones [ 08/Aug/12 ]

Diagnostic patch landed to master so should be in the next tag.

Comment by Peter Jones [ 22/Aug/12 ]

Sarah

Have you been able to test whether this issue still occurs since the diagnostic patch was landed or have you been blocked in doing so by another issue?

Peter

Comment by Sarah Liu [ 22/Aug/12 ]

Hi Peter, due to TT-832, I cannot provision 2.1.1 client to verity this.

Comment by Sarah Liu [ 24/Aug/12 ]

Hit the similar issue with client running 2.1.1, server running lustre-master-tag2.2.93. The debug patch is for master only while this error was seen in interop testing which actually use the script on client. I am trying to port the changes to 2.1.1 and rerun the test.

unfortunately this is the report without debug patch
https://maloo.whamcloud.com/test_sets/da526672-ee0f-11e1-b95b-52540035b04c

Comment by Sarah Liu [ 24/Aug/12 ]

https://maloo.whamcloud.com/test_sets/0d1e405e-ee19-11e1-8649-52540035b04c
The attached are logs including changelog

Comment by Zhenyu Xu [ 27/Aug/12 ]

Sorry Sarah, please try http://review.whamcloud.com/3795 and reproduce it.

patch description
    LU-1458 test: enable lustre_rsync debug log dump

    Make lustre_rsync dump its debug log to help debugging.
Comment by Sarah Liu [ 27/Aug/12 ]

https://maloo.whamcloud.com/test_sets/fdfec3da-f08b-11e1-8816-52540035b04c

Comment by Zhenyu Xu [ 27/Aug/12 ]

Sarah,

The new patch changes lustre/utils/lustre_rsync.c, so we need deploy new images so that lustre_rsync can support this new -D option

Comment by Sarah Liu [ 28/Aug/12 ]

Bobi, the new build failed: http://build.whamcloud.com/job/lustre-reviews/8680/

Comment by Zhenyu Xu [ 29/Aug/12 ]

done the rebuild

Comment by Sarah Liu [ 29/Aug/12 ]

Sarah,

The new patch changes lustre/utils/lustre_rsync.c, so we need deploy new images so that lustre_rsync can support this new -D option

The patch is for master while this error occurs during interop testing between master and 2.1.x. I can manually port the script changes to 2.1.x but not the lustre_rsync.c. Could you please change that on 2.1.x so I can have a review build to test?

Comment by Zhenyu Xu [ 29/Aug/12 ]

b2_1 patch tracking at http://review.whamcloud.com/3822

Comment by Sarah Liu [ 30/Aug/12 ]

https://maloo.whamcloud.com/test_sets/5f1d7e18-f2e4-11e1-b39f-52540035b04c

Comment by Sarah Liu [ 30/Aug/12 ]

changelog

Comment by Zhenyu Xu [ 30/Aug/12 ]

from test_1.changelog

8 08RNMFM 17:53:31.834990148 2012.08.30 0x0 t=[0:0x0:0x0] p=[0x200000400:0x4:0x0] 
9 01CREAT 17:53:31.838991641 2012.08.30 0x0 t=[0x200000400:0x9:0x0] p=[0x200000400:0x3:0x0] file4

and from lrsync_log.client_1.log

***** Start 8 RNMFM (8) [0:0x0:0x0] [0x200000400:0x4:0x0]  *****
move: /tmp/target/d0.lustre-rsync-test/d1/d2/ [to] /tmp/target/d0.lustre-rsync-test/d1/d1/file4 rc1=0, errno=95
move: /tmp/target2/d0.lustre-rsync-test/d1/d2/ [to] /tmp/target2/d0.lustre-rsync-test/d1/d1/file4 rc1=0, errno=95
##### End 8 RNMFM (8) [0:0x0:0x0] [0x200000400:0x4:0x0]  rc=0 #####

and the test_log error shows

Only in /tmp/target/d0.lustre-rsync-test/d1/d1: file4
Only in /mnt/lustre/d0.lustre-rsync-test/d1: d2

the error must happen in lr_move(), lustre_rsync does not handle rename properly, still investigating.

Comment by Zhenyu Xu [ 31/Aug/12 ]

I did a rename operation in master branch, the changelog shows

19 08RENME 06:30:12.444665393 2012.08.31 0x0 t=[0:0x0:0x0] p=[0x200000400:0xc:0x0] file4 s=[0x200000400:0xd:0x0] sp=[0x200000400:0xb:0x0] d2

Sarah, what's the server version your test? I think the client is b2_1.

Comment by Zhenyu Xu [ 31/Aug/12 ]

I guess it's related to http://review.whamcloud.com/2577, old lustre_rsync does not work with newer MDS server with regard to rename operation.

Comment by Zhenyu Xu [ 31/Aug/12 ]

b2_2 port of review#2577 tracking at http://review.whamcloud.com/3834
b2_1 port of review#2577 tracking at http://review.whamcloud.com/3835
b2_3 already has this change.

Comment by Sarah Liu [ 31/Aug/12 ]

server uses build 8694 from this review http://review.whamcloud.com/#change,3795
client uses build 8726 from this review http://review.whamcloud.com/#change,3822

Comment by Jian Yu [ 17/Sep/12 ]

Lustre client build: http://build.whamcloud.com/job/lustre-b2_1/121
Lustre server build: http://build.whamcloud.com/job/lustre-b2_3/19
Distro/Arch: RHEL6.3/x86_64

lustre-rsync-test failed: https://maloo.whamcloud.com/test_sets/7075cfac-008c-11e2-860a-52540035b04c

Comment by Zhenyu Xu [ 17/Sep/12 ]

yujian,

b2_1 patch http://review.whamcloud.com/#change,3835 hasn't landed yet. So lustre_rsync still does not work with b2_3 server.

Comment by Jian Yu [ 10/Oct/12 ]

Lustre Tag: v2_3_0_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/32
Distro/Arch: RHEL6.3/x86_64(server), RHEL5.8/x86_64(client)

This issue occurred again: https://maloo.whamcloud.com/test_sets/fa89fd64-12b4-11e2-a23c-52540035b04c

Bobi, could you please check the above report? The failure occurred on a non-interop environment.

Comment by Zhenyu Xu [ 10/Oct/12 ]

can you upload the $LOGDIR/${TESTSUITE}.test_2b.changelog? (it should be generated on checkdiff error)

Comment by Jian Yu [ 11/Oct/12 ]

can you upload the $LOGDIR/${TESTSUITE}.test_2b.changelog? (it should be generated on checkdiff error)

Attached. Please check. Thanks.

Comment by Zhenyu Xu [ 11/Oct/12 ]

hmm. there are 1511 records in the lustre-rsync-test.test_2b.changelog, and the test log shows that lustre_rsync consumes 1510 records

Changelog records consumed: 1510
Only in /mnt/lustre/d0.lustre-rsync-test/d2/clients/client0/~dmtmp/ACCESS: INV.PRN
lustre-rsync-test test_2b: @@@@@@ FAIL: Failure in replication; differences found.

and the 1511th in the lustre-rsync-test.test_2b.changelog is just the creation of the INV.PRN file

7426 01CREAT 20:46:40.187891001 2012.10.09 0x0 t=[0x200000400:0xd91:0x0] p=[0x200000400:0xaf7:0x0] INV.PRN

Might be some changelog read/write competetion here.

Comment by Zhenyu Xu [ 14/Oct/12 ]

WangDi, is there a way to make sure all changelog recoreds are synced on its dt object?

Comment by Jian Yu [ 18/Dec/12 ]

RHEL6.3/x86_64 (2.3.0 Server + 2.1.4 RC1 Client):
https://maloo.whamcloud.com/test_sets/398724be-4871-11e2-8cdc-52540035b04c

Comment by Keith Mannthey (Inactive) [ 15/Jun/13 ]

Fresh Master error with logs: https://maloo.whamcloud.com/test_sets/04386912-d54d-11e2-bcd8-52540035b04c

test_2b 	

    Error: 'test failed to respond and timed out'
    Failure Rate: 22.00% of last 100 executions [all branches] 

There is plenty of this:

Replication of operation failed(-17): 4123 CREAT (1) [0x200000bd0:0x767:0x0] [0x200000bd0:0x766:0x0] client.txt
Replication of operation failed(-17): 4124 CREAT (1) [0x200000bd0:0x768:0x0] [0x200000bd0:0x766:0x0] dbench
Replication of operation failed(-17): 4125 MKDIR (2) [0x200000bd0:0x769:0x0] [0x200000bd0:0x766:0x0] lib64
Replication of operation failed(-17): 4126 CREAT (1) [0x200000bd0:0x76a:0x0] [0x200000bd0:0x769:0x0] libpopt.so.0
Replication of operation failed(-17): 4129 CREAT (1) [0x200000bd0:0x76b:0x0] [0x200000bd0:0x769:0x0] libc.so.6

17 is EEXIST.

It is not clear if this is the same exact issue but it fails with the same errors.

Comment by Keith Mannthey (Inactive) [ 17/Jun/13 ]

Another one: https://maloo.whamcloud.com/sub_tests/0530d228-d54d-11e2-bcd8-52540035b04c

Comment by Keith Mannthey (Inactive) [ 26/Jun/13 ]

https://maloo.whamcloud.com/test_sets/652535cc-ddfc-11e2-a20c-52540035b04c

 lustre-rsync-test test_2b: @@@@@@ FAIL: Failure in replication; differences found. 
Comment by Bruno Faccini (Inactive) [ 04/Jul/13 ]

https://maloo.whamcloud.com/test_sets/d9ef75f0-e416-11e2-8f78-52540035b04c :

lustre-rsync-test test_2b: @@@@@@ FAIL: Failure in replication; differences found.

Comment by Bob Glossman (Inactive) [ 12/Aug/13 ]

another: https://maloo.whamcloud.com/test_sets/bcadb2aa-035a-11e3-9f24-52540035b04c

lustre-rsync-test test_2b: @@@@@@ FAIL: Failure in replication; differences found.

Comment by Bob Glossman (Inactive) [ 13/Aug/13 ]

another: https://maloo.whamcloud.com/test_sets/2207a0c8-0439-11e3-a8e9-52540035b04c

Comment by Bruno Faccini (Inactive) [ 28/Aug/13 ]

+1 at https://maloo.whamcloud.com/test_sets/35af817e-0f54-11e3-9bce-52540035b04c
and it shows the same difference of one entry between the number of Changelog entries reported during the test and in the content gathered for the test, and this for the CREATE action of the file finally reported as missing during the check.
I wonder if it is a problem with the Changelog sync itself or why not some timing issue with the background dbench real stop before lustre_rsync start ?

Comment by Jian Yu [ 04/Sep/13 ]

Lustre client: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1)
Lustre server: http://build.whamcloud.com/job/lustre-b2_3/41/ (2.3.0)

lustre-rsync-test test 2b failed:
https://maloo.whamcloud.com/test_sets/59afba4a-1502-11e3-ba63-52540035b04c

Comment by Bob Glossman (Inactive) [ 16/Sep/13 ]

another
https://maloo.whamcloud.com/test_sets/7c10d8f4-1e02-11e3-b42b-52540035b04c

Comment by Nathaniel Clark [ 10/Oct/13 ]

The spate of ZFS failures seem to be related to dbench not being started at the beginning of test 2b within the given 20s. Here is a patch to wait longer if necessary:

http://review.whamcloud.com/7914

Comment by Andreas Dilger [ 30/Oct/13 ]

It seems this bug has been subverted from its original purpose of tracking a 2.1/2.4 interop problem into something unrelated that also causes test_2b to fail (dbench not starting quickly enough). It would be better to fix that problem in a separate bug, so that when the patch lands that bug can be closed, and this one is not closed.

Comment by Jian Yu [ 03/Dec/13 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/59/
Distro/Arch: RHEL6.4/x86_64

The same failure occurred:
https://maloo.whamcloud.com/test_sets/2375f856-5817-11e3-b8c3-52540035b04c

Comment by Jian Yu [ 13/Dec/13 ]

More instance on Lustre b2_4 branch:
https://maloo.whamcloud.com/test_sets/ed719c42-6356-11e3-8c76-52540035b04c

Comment by Jian Yu [ 08/Jan/14 ]

An instance on Lustre b2_5 branch:
https://maloo.whamcloud.com/test_sets/40cbb89e-7696-11e3-8c14-52540035b04c

Comment by Bob Glossman (Inactive) [ 15/Jan/14 ]

an instance in master:
https://maloo.whamcloud.com/test_sets/736be8cc-7dc2-11e3-bfda-52540035b04c

Comment by Jian Yu [ 07/Feb/14 ]

More instances on Lustre b2_5 branch:
https://maloo.whamcloud.com/test_sets/2d4b5c08-89b9-11e3-ae0e-52540035b04c
https://maloo.whamcloud.com/test_sets/eec3c6d4-96c2-11e3-b941-52540035b04c

Comment by Bruno Faccini (Inactive) [ 28/Feb/14 ]

+1 on b2_5 branch : https://maloo.whamcloud.com/test_sessions/26ab637c-9b91-11e3-95f0-52540035b04c

Comment by Andreas Dilger [ 18/Mar/14 ]

I'm using LU-4781 as a replacement for this bug, which I think was fixed long ago.

Generated at Sat Feb 10 01:16:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.