[LU-4781] lustre-rsync-test test_2b: Replication of operation failed(-17) Created: 18/Mar/14 Updated: 28/Oct/16 Resolved: 14/Jun/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.5.3, Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | John Hammond |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 13146 | ||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/7edd3618-90d0-11e3-91ee-52540035b04c. The sub-test test_2b failed with the following error in the test output:
Info required for matching: lustre-rsync-test 2b |
| Comments |
| Comment by Jian Yu [ 20/Aug/14 ] |
|
Lustre build: https://build.hpdd.intel.com/job/lustre-b2_5/80/ lustre-rsync-test test 2b failed as follows: Starting replication Replication of operation failed(-17): 3808 CREAT (1) [0x200000400:0x6de:0x0] [0x200000400:0x6dd:0x0] client.txt Replication of operation failed(-17): 3809 CREAT (1) [0x200000400:0x6df:0x0] [0x200000400:0x6dd:0x0] dbench Replication of operation failed(-17): 3810 MKDIR (2) [0x200000400:0x6e0:0x0] [0x200000400:0x6dd:0x0] lib64 Replication of operation failed(-17): 3811 CREAT (1) [0x200000400:0x6e1:0x0] [0x200000400:0x6e0:0x0] libpopt.so.0 --------8<-------- Replication of operation failed(-17): 3954 CREAT (1) [0x200000400:0x76b:0x0] [0x200000400:0x766:0x0] LARGE.FIL ioctl err -22: Invalid argument (22) fid2path error: (/mnt/lustre, [0x200000400:0x903:0x0]) 22 Invalid argument Replication of operation failed(-22): 4819 CREAT (1) [0x200000400:0x903:0x0] [0x200000400:0x70e:0x0] GRAPH1.CDR Lustre filesystem: lustre MDT device: lustre-MDT0000 Source: /mnt/lustre Target: /tmp/target Target: /tmp/target2 Statuslog: /tmp/lustre_rsync.log Changelog registration: cl4 Starting changelog record: 0 Clear changelog after use: no Errors: 142 lustre_rsync took 8 seconds Changelog records consumed: 1515 Only in /mnt/lustre/d2b.lustre-rsync-test/clients/client1/~dmtmp/WORDPRO: NEWS1_1B.PRN lustre-rsync-test test_2b: @@@@@@ FAIL: Failure in replication; differences found. Maloo report: https://testing.hpdd.intel.com/test_sets/de4f4760-26aa-11e4-9de1-5254006e85c2 |
| Comment by James Nunez (Inactive) [ 20/Feb/15 ] |
|
I've run into the same test failure on lustre-master tag 2.6.94 in review-dne-part-1. Logs are at https://testing.hpdd.intel.com/test_sessions/3d80b6cc-b898-11e4-84be-5254006e85c2 . Other review-dne-part-1 failrues: |
| Comment by James Nunez (Inactive) [ 24/Aug/15 ] |
|
Another instance on master (pre-2.8) during review-dne-part-1: |
| Comment by James Nunez (Inactive) [ 11/Dec/15 ] |
|
Another failure on master: |
| Comment by Jian Yu [ 12/Apr/16 ] |
|
I did the following experiments on master branch and hit the similar failures: # lctl --device lustre-MDT0000 changelog_register lustre-MDT0000: Registered changelog userid 'cl1' 3) on Client, copy /etc to /mnt/lustre: # cp -a /etc /mnt/lustre/ 4) on Client, format an ext4 filesystem as the target backup filesystem: # mkfs.ext4 /dev/vda3 # mount -t ext4 /dev/vda3 /mnt/backup/ 5) on Client, perform rsync to make the source and target filesystems identical: # rsync -avh /mnt/lustre/ /mnt/backup/ sending incremental file list --------8<-------- sent 27.47M bytes received 23.02K bytes 4.23M bytes/sec total size is 27.38M speedup is 1.00 6) on Client, perform lustre_rsync: # lustre_rsync --source=/mnt/lustre/ --target=/mnt/backup/ --mdt=lustre-MDT0000 --user=cl1 --statuslog lustre_rsync_statuslog --verbose Replication of operation failed(-17): 1 MKDIR (2) [0x200000401:0x1:0x0] [0x200000007:0x1:0x0] etc Replication of operation failed(-17): 2 CREAT (1) [0x200000401:0x2:0x0] [0x200000401:0x1:0x0] issue.net Replication of operation failed(-17): 4 CREAT (1) [0x200000401:0x3:0x0] [0x200000401:0x1:0x0] yum.conf Replication of operation failed(-17): 6 MKDIR (2) [0x200000401:0x4:0x0] [0x200000401:0x1:0x0] gnupg Replication of operation failed(-17): 9 MKDIR (2) [0x200000401:0x5:0x0] [0x200000401:0x1:0x0] X11 Replication of operation failed(-17): 10 CREAT (1) [0x200000401:0x6:0x0] [0x200000401:0x5:0x0] prefdm Replication of operation failed(-17): 12 MKDIR (2) [0x200000401:0x7:0x0] [0x200000401:0x5:0x0] fontpath.d Replication of operation failed(-17): 13 SLINK (4) [0x200000401:0x8:0x0] [0x200000401:0x7:0x0] fonts-default Replication of operation failed(-17): 15 SLINK (4) [0x200000401:0x9:0x0] [0x200000401:0x7:0x0] default-ghostscript --------8<-------- Lustre filesystem: lustre MDT device: lustre-MDT0000 Source: /mnt/lustre/ Target: /mnt/backup/ Statuslog: lustre_rsync_statuslog Changelog registration: cl1 Starting changelog record: 0 Errors: 1782 lustre_rsync took 14 seconds Changelog records consumed: 3735 |
| Comment by Jian Yu [ 12/Apr/16 ] |
|
It turns out registering the changelog user needs to be done after running rsync. I did the experiment again and succeeded in running lustre_rsync: # lustre_rsync --source=/mnt/lustre/ --target=/mnt/backup/ --mdt=lustre-MDT0000 --user=cl1 --statuslog lustre_rsync_statuslog --verbose Lustre filesystem: lustre MDT device: lustre-MDT0000 Source: /mnt/lustre/ Target: /mnt/backup/ Statuslog: lustre_rsync_statuslog Changelog registration: cl1 Starting changelog record: 0 Errors: 0 lustre_rsync took 0 seconds Changelog records consumed: 0 |
| Comment by John Hammond [ 24/May/16 ] |
|
I think that this is because we use SIGSTOP and SIGCONT to control dbench. echo Stopping dbench
$KILL -SIGSTOP $child_pid
echo Starting replication
$LRSYNC -l $LREPL_LOG -D $LRSYNC_LOG
check_diff $DIR/$tdir $TGT/$tdir
Even though kill() has returned the SIGSTOP may still be pending (for example, if the the dbench task is executing the the kernel, doing I/O). We should sleep for some time after sending SIGSTOP or even better wait until all dbench tasks have a stopped state. |
| Comment by Gerrit Updater [ 26/May/16 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/20471 |
| Comment by Gerrit Updater [ 30/May/16 ] |
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/20471/ |
| Comment by James Nunez (Inactive) [ 06/Jun/16 ] |
|
lustre-rsyn-test test 2b is still failing even after patch 20471 landed to master. Here are some recent failures: |
| Comment by Andreas Dilger [ 06/Jun/16 ] |
|
I think it is strange that the error -17 = -EEXIST is the only one being hit. Is the MDS not properly registering or processing unlink records? Is lustre_rsync returning an error for CREAT records (e.g. using O_EXCL) when they should be allowed to open(O_CREAT) files that already exist? Is lustre_rsync not applying the CREAT records in the right parent directory? Is it worthwhile to submit a test patch to collect the full changelog from the dbench run so that it is possible to trace the complete lifetime of a parent directory and file so that this problem can be debugged more effectively? Looking at the lustre_rsync log it appears that we are processing the ChangeLog entries twice? For example, the file "STUDENTS.TV" FID [0x200006990:0x414:0x0] is created in record 5056 but that record appears twice in the log and the second time produces the -EEXIST error: **** Start 5056 CREAT (1) [0x200006990:0x414:0x0] [0x200006990:0x3f0:0x0] STUDENTS.TV ***** dest = d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV; savedpath = d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV mkfile(1) /tmp/target/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV Syncing data and attributes [0x200006990:0x414:0x0] llistxattr(/mnt/lustre/.lustre/fid/[0x200006990:0x414:0x0],(nil)) returned 64, errno=0 llistxattr /mnt/lustre/.lustre/fid/[0x200006990:0x414:0x0] returned 64, errno=0 (trusted.version,10469904) rc=0x8 lsetxattr(), rc=0, errno=0 (trusted.link,10469904) rc=0x35 lsetxattr(), rc=0, errno=0 (trusted.lov,10469904) rc=0x38 lsetxattr(), rc=0, errno=0 (trusted.lma,10469904) rc=0x18 lsetxattr(), rc=0, errno=0 (lustre.lov,10469904) rc=0x38 lsetxattr(), rc=-1, errno=95 setxattr: /mnt/lustre/.lustre/fid/[0x200006990:0x414:0x0] /tmp/target/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV mkfile(1) /tmp/target2/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV Syncing data and attributes [0x200006990:0x414:0x0] llistxattr(/mnt/lustre/.lustre/fid/[0x200006990:0x414:0x0],0x9fb200) returned 64, errno=0 (trusted.version,10469904) rc=0x8 lsetxattr(), rc=0, errno=0 (trusted.link,10469904) rc=0x35 lsetxattr(), rc=0, errno=0 (trusted.lov,10469904) rc=0x38 lsetxattr(), rc=0, errno=0 (trusted.lma,10469904) rc=0x18 lsetxattr(), rc=0, errno=0 (lustre.lov,10469904) rc=0x38 lsetxattr(), rc=-1, errno=95 setxattr: /mnt/lustre/.lustre/fid/[0x200006990:0x414:0x0] /tmp/target2/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV ##### End 5056 CREAT (1) [0x200006990:0x414:0x0] [0x200006990:0x3f0:0x0] STUDENTS.TV rc=0 ##### : : ***** Start 5058 CLOSE (11) [0x200006990:0x414:0x0] [0:0x0:0x0] ***** ##### End 5058 CLOSE (11) [0x200006990:0x414:0x0] [0:0x0:0x0] rc=0 ##### : : : ***** Start 5056 CREAT (1) [0x200006990:0x414:0x0] [0x200006990:0x3f0:0x0] STUDENTS.TV ***** dest = d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV; savedpath = d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV mkfile(1) /tmp/target/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV mkfile(1) /tmp/target2/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV ##### End 5056 CREAT (1) [0x200006990:0x414:0x0] [0x200006990:0x3f0:0x0] STUDENTS.TV rc=-17 ##### ***** Start 5058 CLOSE (11) [0x200006990:0x414:0x0] [0:0x0:0x0] ***** ##### End 5058 CLOSE (11) [0x200006990:0x414:0x0] [0:0x0:0x0] rc=0 ##### It appears that the log processing restarts at some point, but I don't know what the cause is. It might be ChangeLog wrapping in the kernel? : : ***** Start 5519 CLOSE (11) [0x200006990:0x460:0x0] [0:0x0:0x0] ***** ##### End 5519 CLOSE (11) [0x200006990:0x460:0x0] [0:0x0:0x0] rc=0 ##### ***** Start 5520 CLOSE (11) [0x200006990:0x460:0x0] [0:0x0:0x0] ***** ##### End 5520 CLOSE (11) [0x200006990:0x460:0x0] [0:0x0:0x0] rc=0 ##### ***** Start 4823 CREAT (1) [0x200006990:0x397:0x0] [0x200006990:0x395:0x0] client.txt ***** dest = d2b.lustre-rsync-test/client.txt; savedpath = d2b.lustre-rsync-test/client.txt mkfile(1) /tmp/target/d2b.lustre-rsync-test/client.txt mkfile(1) /tmp/target2/d2b.lustre-rsync-test/client.txt ##### End 4823 CREAT (1) [0x200006990:0x397:0x0] [0x200006990:0x395:0x0] client.txt rc=-17 ##### ***** Start 4824 CLOSE (11) [0x200006990:0x397:0x0] [0:0x0:0x0] ***** ##### End 4824 CLOSE (11) [0x200006990:0x397:0x0] [0:0x0:0x0] rc=0 ##### : : |
| Comment by Joseph Gmitter (Inactive) [ 06/Jun/16 ] |
|
Hi John, Can you have another look at this one? It seems even with the landing of http://review.whamcloud.com/#/c/20471/ that the problem is still occurring. Thanks. |
| Comment by John Hammond [ 06/Jun/16 ] |
|
> I think it is strange that the error -17 = -EEXIST is the only one being hit. Is the MDS not properly registering or processing unlink records? Is lustre_rsync returning an error for CREAT records (e.g. using O_EXCL) when they should be allowed to open(O_CREAT) files that already exist? Is lustre_rsync not applying the CREAT records in the right parent directory? It uses mknod(). The -EEXIST errors are also seen on successful runs. But it's conceivable that there is a race where the unhandled -EEXIST causes the files to end up with different contents. I'll push a small patch. |
| Comment by Andreas Dilger [ 06/Jun/16 ] |
|
I think the more important problem is that it appears lustre_rsync is processing the same ChangeLog records twice. Is that because lustre_rsync is processing the logs incorrectly, because the "clear logs before record X" request isn't handled or isn't persistent (ala |
| Comment by John Hammond [ 06/Jun/16 ] |
|
> For example, the file "STUDENTS.TV" FID [0x200006990:0x414:0x0] is created in record 5056 but that record appears twice in the log and the second time produces the -EEXIST error: This is because lustre_rsync is invoked with -c no which tells it not to clear the clangelog when done. |
| Comment by Gerrit Updater [ 06/Jun/16 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/20649 |
| Comment by nasf (Inactive) [ 13/Jun/16 ] |
|
Another failure instance on master: |
| Comment by Gerrit Updater [ 14/Jun/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20649/ |
| Comment by Joseph Gmitter (Inactive) [ 14/Jun/16 ] |
|
2nd patch has landed to master for 2.9 |
| Comment by Bob Glossman (Inactive) [ 21/Sep/16 ] |
|
more on b2_8_fe: |