[LU-4781] lustre-rsync-test test_2b: Replication of operation failed(-17) Created: 18/Mar/14  Updated: 28/Oct/16  Resolved: 14/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.5.3, Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-2121 Test failure on test suite lustre-rsy... Resolved
is related to LU-4256 lustre-rsync-test test_2b: Failure in... Open
is related to LU-1458 lustre-rsync-test test_2b: old lustre... Resolved
is related to LU-6644 lustre-rsync-test test_2b:fid2path er... Resolved
Severity: 3
Rank (Obsolete): 13146

 Description   

This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/7edd3618-90d0-11e3-91ee-52540035b04c.

The sub-test test_2b failed with the following error in the test output:

Replication of operation failed(-17): 4274 CREAT (1) [0x400000400:0x7f4:0x0] [0x400000400:0x7f3:0x0] SMALL.FIL
Replication of operation failed(-17): 4275 CREAT (1) [0x400000400:0x7f5:0x0] [0x400000400:0x7ef:0x0] LARGE.FIL
Replication of operation failed(-17): 4276 CREAT (1) [0x400000400:0x7f6:0x0] [0x400000400:0x7f3:0x0] MEDIUM.FIL
Replication of operation failed(-17): 4277 CREAT (1) [0x400000400:0x7f7:0x0] [0x400000400:0x7f3:0x0] LARGE.FIL

Info required for matching: lustre-rsync-test 2b



 Comments   
Comment by Jian Yu [ 20/Aug/14 ]

Lustre build: https://build.hpdd.intel.com/job/lustre-b2_5/80/
Distro/Arch: RHEL6.5/x86_64
FSTYPE=ldiskfs

lustre-rsync-test test 2b failed as follows:

Starting replication
Replication of operation failed(-17): 3808 CREAT (1) [0x200000400:0x6de:0x0] [0x200000400:0x6dd:0x0] client.txt
Replication of operation failed(-17): 3809 CREAT (1) [0x200000400:0x6df:0x0] [0x200000400:0x6dd:0x0] dbench
Replication of operation failed(-17): 3810 MKDIR (2) [0x200000400:0x6e0:0x0] [0x200000400:0x6dd:0x0] lib64
Replication of operation failed(-17): 3811 CREAT (1) [0x200000400:0x6e1:0x0] [0x200000400:0x6e0:0x0] libpopt.so.0
--------8<--------
Replication of operation failed(-17): 3954 CREAT (1) [0x200000400:0x76b:0x0] [0x200000400:0x766:0x0] LARGE.FIL
ioctl err -22: Invalid argument (22)
fid2path error: (/mnt/lustre, [0x200000400:0x903:0x0]) 22 Invalid argument
Replication of operation failed(-22): 4819 CREAT (1) [0x200000400:0x903:0x0] [0x200000400:0x70e:0x0] GRAPH1.CDR
Lustre filesystem: lustre
MDT device: lustre-MDT0000
Source: /mnt/lustre
Target: /tmp/target
Target: /tmp/target2
Statuslog: /tmp/lustre_rsync.log
Changelog registration: cl4
Starting changelog record: 0
Clear changelog after use: no
Errors: 142
lustre_rsync took 8 seconds
Changelog records consumed: 1515
Only in /mnt/lustre/d2b.lustre-rsync-test/clients/client1/~dmtmp/WORDPRO: NEWS1_1B.PRN
 lustre-rsync-test test_2b: @@@@@@ FAIL: Failure in replication; differences found. 

Maloo report: https://testing.hpdd.intel.com/test_sets/de4f4760-26aa-11e4-9de1-5254006e85c2

Comment by James Nunez (Inactive) [ 20/Feb/15 ]

I've run into the same test failure on lustre-master tag 2.6.94 in review-dne-part-1. Logs are at https://testing.hpdd.intel.com/test_sessions/3d80b6cc-b898-11e4-84be-5254006e85c2 .

Other review-dne-part-1 failrues:
2015-07-09 15:01:36 - https://testing.hpdd.intel.com/test_sets/a6bf7468-26cd-11e5-8cf5-5254006e85c2
2015-07-10 03:03:45 - https://testing.hpdd.intel.com/test_sets/4df16926-26c4-11e5-8cf5-5254006e85c2
2015-07-11 21:05:09 - https://testing.hpdd.intel.com/test_sets/8d39b85e-282c-11e5-b280-5254006e85c2

Comment by James Nunez (Inactive) [ 24/Aug/15 ]

Another instance on master (pre-2.8) during review-dne-part-1:
2015-08-21 17:55:32 - https://testing.hpdd.intel.com/test_sets/8973091c-483b-11e5-813b-5254006e85c2
2015-08-24 11:09:22 - https://testing.hpdd.intel.com/test_sets/a1d65e12-4a5e-11e5-88e8-5254006e85c2
2015-08-26 21:21:35 - https://testing.hpdd.intel.com/test_sets/f981d950-4c45-11e5-8a6a-5254006e85c2

Comment by James Nunez (Inactive) [ 11/Dec/15 ]

Another failure on master:
2015-12-10 14:51:22 - https://testing.hpdd.intel.com/test_sets/5c0fb6a8-9f72-11e5-8c89-5254006e85c2

Comment by Jian Yu [ 12/Apr/16 ]

I did the following experiments on master branch and hit the similar failures:
1) format and setup the Lustre filesystem
2) on MDS, register a changelog user for MDT:

# lctl --device lustre-MDT0000 changelog_register
lustre-MDT0000: Registered changelog userid 'cl1'

3) on Client, copy /etc to /mnt/lustre:

# cp -a /etc /mnt/lustre/

4) on Client, format an ext4 filesystem as the target backup filesystem:

# mkfs.ext4 /dev/vda3
# mount -t ext4 /dev/vda3 /mnt/backup/

5) on Client, perform rsync to make the source and target filesystems identical:

# rsync -avh  /mnt/lustre/ /mnt/backup/
sending incremental file list
--------8<--------
sent 27.47M bytes  received 23.02K bytes  4.23M bytes/sec
total size is 27.38M  speedup is 1.00

6) on Client, perform lustre_rsync:

# lustre_rsync --source=/mnt/lustre/ --target=/mnt/backup/ --mdt=lustre-MDT0000 --user=cl1 --statuslog lustre_rsync_statuslog --verbose
Replication of operation failed(-17): 1 MKDIR (2) [0x200000401:0x1:0x0] [0x200000007:0x1:0x0] etc
Replication of operation failed(-17): 2 CREAT (1) [0x200000401:0x2:0x0] [0x200000401:0x1:0x0] issue.net
Replication of operation failed(-17): 4 CREAT (1) [0x200000401:0x3:0x0] [0x200000401:0x1:0x0] yum.conf
Replication of operation failed(-17): 6 MKDIR (2) [0x200000401:0x4:0x0] [0x200000401:0x1:0x0] gnupg
Replication of operation failed(-17): 9 MKDIR (2) [0x200000401:0x5:0x0] [0x200000401:0x1:0x0] X11
Replication of operation failed(-17): 10 CREAT (1) [0x200000401:0x6:0x0] [0x200000401:0x5:0x0] prefdm
Replication of operation failed(-17): 12 MKDIR (2) [0x200000401:0x7:0x0] [0x200000401:0x5:0x0] fontpath.d
Replication of operation failed(-17): 13 SLINK (4) [0x200000401:0x8:0x0] [0x200000401:0x7:0x0] fonts-default
Replication of operation failed(-17): 15 SLINK (4) [0x200000401:0x9:0x0] [0x200000401:0x7:0x0] default-ghostscript
--------8<--------
Lustre filesystem: lustre
MDT device: lustre-MDT0000
Source: /mnt/lustre/
Target: /mnt/backup/
Statuslog: lustre_rsync_statuslog
Changelog registration: cl1
Starting changelog record: 0
Errors: 1782
lustre_rsync took 14 seconds
Changelog records consumed: 3735
Comment by Jian Yu [ 12/Apr/16 ]

It turns out registering the changelog user needs to be done after running rsync. I did the experiment again and succeeded in running lustre_rsync:

# lustre_rsync --source=/mnt/lustre/ --target=/mnt/backup/ --mdt=lustre-MDT0000 --user=cl1 --statuslog lustre_rsync_statuslog --verbose
Lustre filesystem: lustre
MDT device: lustre-MDT0000
Source: /mnt/lustre/
Target: /mnt/backup/
Statuslog: lustre_rsync_statuslog
Changelog registration: cl1
Starting changelog record: 0
Errors: 0
lustre_rsync took 0 seconds
Changelog records consumed: 0
Comment by John Hammond [ 24/May/16 ]

I think that this is because we use SIGSTOP and SIGCONT to control dbench.

    echo Stopping dbench
    $KILL -SIGSTOP $child_pid

    echo Starting replication
    $LRSYNC -l $LREPL_LOG -D $LRSYNC_LOG
    check_diff $DIR/$tdir $TGT/$tdir

Even though kill() has returned the SIGSTOP may still be pending (for example, if the the dbench task is executing the the kernel, doing I/O). We should sleep for some time after sending SIGSTOP or even better wait until all dbench tasks have a stopped state.

Comment by Gerrit Updater [ 26/May/16 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/20471
Subject: LU-4781 test: wait for dbench to stop
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d5fe8aeb0b6ea00165ddc98e13c5fb270fad8e7d

Comment by Gerrit Updater [ 30/May/16 ]

Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/20471/
Subject: LU-4781 test: wait for dbench to stop
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f7452bafb79d94c88b5fb390e431b06e79325f99

Comment by James Nunez (Inactive) [ 06/Jun/16 ]

lustre-rsyn-test test 2b is still failing even after patch 20471 landed to master. Here are some recent failures:
2016-06-05 - https://testing.hpdd.intel.com/test_sets/f3692b30-2b07-11e6-80b9-5254006e85c2
2016-06-05 - https://testing.hpdd.intel.com/test_sets/0485f5aa-2b5e-11e6-80b9-5254006e85c2
2016-06-05 - https://testing.hpdd.intel.com/test_sets/9874c1e0-2b60-11e6-80b9-5254006e85c2
2016-06-06 - https://testing.hpdd.intel.com/test_sets/36032394-2ba6-11e6-acf3-5254006e85c2

Comment by Andreas Dilger [ 06/Jun/16 ]

I think it is strange that the error -17 = -EEXIST is the only one being hit. Is the MDS not properly registering or processing unlink records? Is lustre_rsync returning an error for CREAT records (e.g. using O_EXCL) when they should be allowed to open(O_CREAT) files that already exist? Is lustre_rsync not applying the CREAT records in the right parent directory?

Is it worthwhile to submit a test patch to collect the full changelog from the dbench run so that it is possible to trace the complete lifetime of a parent directory and file so that this problem can be debugged more effectively?

Looking at the lustre_rsync log it appears that we are processing the ChangeLog entries twice?
https://testing.hpdd.intel.com/sub_tests/1a5aee48-2be6-11e6-a0ce-5254006e85c2
https://testing.hpdd.intel.com/test_logs/1a89dee2-2be6-11e6-a0ce-5254006e85c2/show_text
https://testing.hpdd.intel.com/test_logs/1aa09a56-2be6-11e6-a0ce-5254006e85c2/show_text

For example, the file "STUDENTS.TV" FID [0x200006990:0x414:0x0] is created in record 5056 but that record appears twice in the log and the second time produces the -EEXIST error:

**** Start 5056 CREAT (1) [0x200006990:0x414:0x0] [0x200006990:0x3f0:0x0] STUDENTS.TV *****
dest = d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV; savedpath = d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV
mkfile(1) /tmp/target/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV 
Syncing data and attributes [0x200006990:0x414:0x0]
llistxattr(/mnt/lustre/.lustre/fid/[0x200006990:0x414:0x0],(nil)) returned 64, errno=0
llistxattr /mnt/lustre/.lustre/fid/[0x200006990:0x414:0x0] returned 64, errno=0
	(trusted.version,10469904) rc=0x8
	lsetxattr(), rc=0, errno=0
	(trusted.link,10469904) rc=0x35
	lsetxattr(), rc=0, errno=0
	(trusted.lov,10469904) rc=0x38
	lsetxattr(), rc=0, errno=0
	(trusted.lma,10469904) rc=0x18
	lsetxattr(), rc=0, errno=0
	(lustre.lov,10469904) rc=0x38
	lsetxattr(), rc=-1, errno=95
setxattr: /mnt/lustre/.lustre/fid/[0x200006990:0x414:0x0] /tmp/target/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV
mkfile(1) /tmp/target2/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV 
Syncing data and attributes [0x200006990:0x414:0x0]
llistxattr(/mnt/lustre/.lustre/fid/[0x200006990:0x414:0x0],0x9fb200) returned 64, errno=0
	(trusted.version,10469904) rc=0x8
	lsetxattr(), rc=0, errno=0
	(trusted.link,10469904) rc=0x35
	lsetxattr(), rc=0, errno=0
	(trusted.lov,10469904) rc=0x38
	lsetxattr(), rc=0, errno=0
	(trusted.lma,10469904) rc=0x18
	lsetxattr(), rc=0, errno=0
	(lustre.lov,10469904) rc=0x38
	lsetxattr(), rc=-1, errno=95
setxattr: /mnt/lustre/.lustre/fid/[0x200006990:0x414:0x0] /tmp/target2/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV
##### End 5056 CREAT (1) [0x200006990:0x414:0x0] [0x200006990:0x3f0:0x0] STUDENTS.TV rc=0 #####
:
:
***** Start 5058 CLOSE (11) [0x200006990:0x414:0x0] [0:0x0:0x0]  *****
##### End 5058 CLOSE (11) [0x200006990:0x414:0x0] [0:0x0:0x0]  rc=0 #####
:
:
:
***** Start 5056 CREAT (1) [0x200006990:0x414:0x0] [0x200006990:0x3f0:0x0] STUDENTS.TV *****
dest = d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV; savedpath = d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV
mkfile(1) /tmp/target/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV 
mkfile(1) /tmp/target2/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV 
##### End 5056 CREAT (1) [0x200006990:0x414:0x0] [0x200006990:0x3f0:0x0] STUDENTS.TV rc=-17 #####
***** Start 5058 CLOSE (11) [0x200006990:0x414:0x0] [0:0x0:0x0]  *****
##### End 5058 CLOSE (11) [0x200006990:0x414:0x0] [0:0x0:0x0]  rc=0 #####

It appears that the log processing restarts at some point, but I don't know what the cause is. It might be ChangeLog wrapping in the kernel?

:
:
***** Start 5519 CLOSE (11) [0x200006990:0x460:0x0] [0:0x0:0x0]  *****
##### End 5519 CLOSE (11) [0x200006990:0x460:0x0] [0:0x0:0x0]  rc=0 #####
***** Start 5520 CLOSE (11) [0x200006990:0x460:0x0] [0:0x0:0x0]  *****
##### End 5520 CLOSE (11) [0x200006990:0x460:0x0] [0:0x0:0x0]  rc=0 #####
***** Start 4823 CREAT (1) [0x200006990:0x397:0x0] [0x200006990:0x395:0x0] client.txt *****
dest = d2b.lustre-rsync-test/client.txt; savedpath = d2b.lustre-rsync-test/client.txt
mkfile(1) /tmp/target/d2b.lustre-rsync-test/client.txt 
mkfile(1) /tmp/target2/d2b.lustre-rsync-test/client.txt 
##### End 4823 CREAT (1) [0x200006990:0x397:0x0] [0x200006990:0x395:0x0] client.txt rc=-17 #####
***** Start 4824 CLOSE (11) [0x200006990:0x397:0x0] [0:0x0:0x0]  *****
##### End 4824 CLOSE (11) [0x200006990:0x397:0x0] [0:0x0:0x0]  rc=0 #####
:
:
Comment by Joseph Gmitter (Inactive) [ 06/Jun/16 ]

Hi John,

Can you have another look at this one? It seems even with the landing of http://review.whamcloud.com/#/c/20471/ that the problem is still occurring.

Thanks.
Joe

Comment by John Hammond [ 06/Jun/16 ]

> I think it is strange that the error -17 = -EEXIST is the only one being hit. Is the MDS not properly registering or processing unlink records? Is lustre_rsync returning an error for CREAT records (e.g. using O_EXCL) when they should be allowed to open(O_CREAT) files that already exist? Is lustre_rsync not applying the CREAT records in the right parent directory?

It uses mknod(). The -EEXIST errors are also seen on successful runs. But it's conceivable that there is a race where the unhandled -EEXIST causes the files to end up with different contents. I'll push a small patch.

Comment by Andreas Dilger [ 06/Jun/16 ]

I think the more important problem is that it appears lustre_rsync is processing the same ChangeLog records twice. Is that because lustre_rsync is processing the logs incorrectly, because the "clear logs before record X" request isn't handled or isn't persistent (ala LU-7428) across restarts, or because of an internal ChangeLog bug that is returning bogus duplicate records due to ChangeLog index wrap, bad llog handling, etc?

Comment by John Hammond [ 06/Jun/16 ]

> For example, the file "STUDENTS.TV" FID [0x200006990:0x414:0x0] is created in record 5056 but that record appears twice in the log and the second time produces the -EEXIST error:

This is because lustre_rsync is invoked with -c no which tells it not to clear the clangelog when done.

Comment by Gerrit Updater [ 06/Jun/16 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/20649
Subject: LU-4781 utils: handle EEXIST in lustre_rsync
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bb459c9631361cb28bd8fe367de99603e3d67ff0

Comment by nasf (Inactive) [ 13/Jun/16 ]

Another failure instance on master:
https://testing.hpdd.intel.com/test_sets/f7240550-308a-11e6-a0ce-5254006e85c2

Comment by Gerrit Updater [ 14/Jun/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20649/
Subject: LU-4781 utils: handle EEXIST in lustre_rsync
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2a55f34bd5514e905f6049f4e293352bd09ffe91

Comment by Joseph Gmitter (Inactive) [ 14/Jun/16 ]

2nd patch has landed to master for 2.9

Comment by Bob Glossman (Inactive) [ 21/Sep/16 ]

more on b2_8_fe:
https://testing.hpdd.intel.com/test_sets/885f908c-7f8e-11e6-8a8c-5254006e85c2
https://testing.hpdd.intel.com/test_sets/25fa9bc8-7ec9-11e6-8afd-5254006e85c2

Generated at Sat Feb 10 01:45:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.