Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4781

lustre-rsync-test test_2b: Replication of operation failed(-17)

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.6.0, Lustre 2.5.3, Lustre 2.9.0
    • None
    • 3
    • 13146

    Description

      This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/7edd3618-90d0-11e3-91ee-52540035b04c.

      The sub-test test_2b failed with the following error in the test output:

      Replication of operation failed(-17): 4274 CREAT (1) [0x400000400:0x7f4:0x0] [0x400000400:0x7f3:0x0] SMALL.FIL
      Replication of operation failed(-17): 4275 CREAT (1) [0x400000400:0x7f5:0x0] [0x400000400:0x7ef:0x0] LARGE.FIL
      Replication of operation failed(-17): 4276 CREAT (1) [0x400000400:0x7f6:0x0] [0x400000400:0x7f3:0x0] MEDIUM.FIL
      Replication of operation failed(-17): 4277 CREAT (1) [0x400000400:0x7f7:0x0] [0x400000400:0x7f3:0x0] LARGE.FIL

      Info required for matching: lustre-rsync-test 2b

      Attachments

        Issue Links

          Activity

            [LU-4781] lustre-rsync-test test_2b: Replication of operation failed(-17)
            bogl Bob Glossman (Inactive) added a comment - - edited more on b2_8_fe: https://testing.hpdd.intel.com/test_sets/885f908c-7f8e-11e6-8a8c-5254006e85c2 https://testing.hpdd.intel.com/test_sets/25fa9bc8-7ec9-11e6-8afd-5254006e85c2

            2nd patch has landed to master for 2.9

            jgmitter Joseph Gmitter (Inactive) added a comment - 2nd patch has landed to master for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20649/
            Subject: LU-4781 utils: handle EEXIST in lustre_rsync
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 2a55f34bd5514e905f6049f4e293352bd09ffe91

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20649/ Subject: LU-4781 utils: handle EEXIST in lustre_rsync Project: fs/lustre-release Branch: master Current Patch Set: Commit: 2a55f34bd5514e905f6049f4e293352bd09ffe91
            yong.fan nasf (Inactive) added a comment - Another failure instance on master: https://testing.hpdd.intel.com/test_sets/f7240550-308a-11e6-a0ce-5254006e85c2

            John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/20649
            Subject: LU-4781 utils: handle EEXIST in lustre_rsync
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bb459c9631361cb28bd8fe367de99603e3d67ff0

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/20649 Subject: LU-4781 utils: handle EEXIST in lustre_rsync Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bb459c9631361cb28bd8fe367de99603e3d67ff0
            jhammond John Hammond added a comment -

            > For example, the file "STUDENTS.TV" FID [0x200006990:0x414:0x0] is created in record 5056 but that record appears twice in the log and the second time produces the -EEXIST error:

            This is because lustre_rsync is invoked with -c no which tells it not to clear the clangelog when done.

            jhammond John Hammond added a comment - > For example, the file "STUDENTS.TV" FID [0x200006990:0x414:0x0] is created in record 5056 but that record appears twice in the log and the second time produces the -EEXIST error: This is because lustre_rsync is invoked with -c no which tells it not to clear the clangelog when done.

            I think the more important problem is that it appears lustre_rsync is processing the same ChangeLog records twice. Is that because lustre_rsync is processing the logs incorrectly, because the "clear logs before record X" request isn't handled or isn't persistent (ala LU-7428) across restarts, or because of an internal ChangeLog bug that is returning bogus duplicate records due to ChangeLog index wrap, bad llog handling, etc?

            adilger Andreas Dilger added a comment - I think the more important problem is that it appears lustre_rsync is processing the same ChangeLog records twice. Is that because lustre_rsync is processing the logs incorrectly, because the "clear logs before record X" request isn't handled or isn't persistent (ala LU-7428 ) across restarts, or because of an internal ChangeLog bug that is returning bogus duplicate records due to ChangeLog index wrap, bad llog handling, etc?
            jhammond John Hammond added a comment -

            > I think it is strange that the error -17 = -EEXIST is the only one being hit. Is the MDS not properly registering or processing unlink records? Is lustre_rsync returning an error for CREAT records (e.g. using O_EXCL) when they should be allowed to open(O_CREAT) files that already exist? Is lustre_rsync not applying the CREAT records in the right parent directory?

            It uses mknod(). The -EEXIST errors are also seen on successful runs. But it's conceivable that there is a race where the unhandled -EEXIST causes the files to end up with different contents. I'll push a small patch.

            jhammond John Hammond added a comment - > I think it is strange that the error -17 = -EEXIST is the only one being hit. Is the MDS not properly registering or processing unlink records? Is lustre_rsync returning an error for CREAT records (e.g. using O_EXCL) when they should be allowed to open(O_CREAT) files that already exist? Is lustre_rsync not applying the CREAT records in the right parent directory? It uses mknod() . The -EEXIST errors are also seen on successful runs. But it's conceivable that there is a race where the unhandled -EEXIST causes the files to end up with different contents. I'll push a small patch.

            Hi John,

            Can you have another look at this one? It seems even with the landing of http://review.whamcloud.com/#/c/20471/ that the problem is still occurring.

            Thanks.
            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - Hi John, Can you have another look at this one? It seems even with the landing of http://review.whamcloud.com/#/c/20471/ that the problem is still occurring. Thanks. Joe

            I think it is strange that the error -17 = -EEXIST is the only one being hit. Is the MDS not properly registering or processing unlink records? Is lustre_rsync returning an error for CREAT records (e.g. using O_EXCL) when they should be allowed to open(O_CREAT) files that already exist? Is lustre_rsync not applying the CREAT records in the right parent directory?

            Is it worthwhile to submit a test patch to collect the full changelog from the dbench run so that it is possible to trace the complete lifetime of a parent directory and file so that this problem can be debugged more effectively?

            Looking at the lustre_rsync log it appears that we are processing the ChangeLog entries twice?
            https://testing.hpdd.intel.com/sub_tests/1a5aee48-2be6-11e6-a0ce-5254006e85c2
            https://testing.hpdd.intel.com/test_logs/1a89dee2-2be6-11e6-a0ce-5254006e85c2/show_text
            https://testing.hpdd.intel.com/test_logs/1aa09a56-2be6-11e6-a0ce-5254006e85c2/show_text

            For example, the file "STUDENTS.TV" FID [0x200006990:0x414:0x0] is created in record 5056 but that record appears twice in the log and the second time produces the -EEXIST error:

            **** Start 5056 CREAT (1) [0x200006990:0x414:0x0] [0x200006990:0x3f0:0x0] STUDENTS.TV *****
            dest = d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV; savedpath = d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV
            mkfile(1) /tmp/target/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV 
            Syncing data and attributes [0x200006990:0x414:0x0]
            llistxattr(/mnt/lustre/.lustre/fid/[0x200006990:0x414:0x0],(nil)) returned 64, errno=0
            llistxattr /mnt/lustre/.lustre/fid/[0x200006990:0x414:0x0] returned 64, errno=0
            	(trusted.version,10469904) rc=0x8
            	lsetxattr(), rc=0, errno=0
            	(trusted.link,10469904) rc=0x35
            	lsetxattr(), rc=0, errno=0
            	(trusted.lov,10469904) rc=0x38
            	lsetxattr(), rc=0, errno=0
            	(trusted.lma,10469904) rc=0x18
            	lsetxattr(), rc=0, errno=0
            	(lustre.lov,10469904) rc=0x38
            	lsetxattr(), rc=-1, errno=95
            setxattr: /mnt/lustre/.lustre/fid/[0x200006990:0x414:0x0] /tmp/target/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV
            mkfile(1) /tmp/target2/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV 
            Syncing data and attributes [0x200006990:0x414:0x0]
            llistxattr(/mnt/lustre/.lustre/fid/[0x200006990:0x414:0x0],0x9fb200) returned 64, errno=0
            	(trusted.version,10469904) rc=0x8
            	lsetxattr(), rc=0, errno=0
            	(trusted.link,10469904) rc=0x35
            	lsetxattr(), rc=0, errno=0
            	(trusted.lov,10469904) rc=0x38
            	lsetxattr(), rc=0, errno=0
            	(trusted.lma,10469904) rc=0x18
            	lsetxattr(), rc=0, errno=0
            	(lustre.lov,10469904) rc=0x38
            	lsetxattr(), rc=-1, errno=95
            setxattr: /mnt/lustre/.lustre/fid/[0x200006990:0x414:0x0] /tmp/target2/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV
            ##### End 5056 CREAT (1) [0x200006990:0x414:0x0] [0x200006990:0x3f0:0x0] STUDENTS.TV rc=0 #####
            :
            :
            ***** Start 5058 CLOSE (11) [0x200006990:0x414:0x0] [0:0x0:0x0]  *****
            ##### End 5058 CLOSE (11) [0x200006990:0x414:0x0] [0:0x0:0x0]  rc=0 #####
            :
            :
            :
            ***** Start 5056 CREAT (1) [0x200006990:0x414:0x0] [0x200006990:0x3f0:0x0] STUDENTS.TV *****
            dest = d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV; savedpath = d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV
            mkfile(1) /tmp/target/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV 
            mkfile(1) /tmp/target2/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV 
            ##### End 5056 CREAT (1) [0x200006990:0x414:0x0] [0x200006990:0x3f0:0x0] STUDENTS.TV rc=-17 #####
            ***** Start 5058 CLOSE (11) [0x200006990:0x414:0x0] [0:0x0:0x0]  *****
            ##### End 5058 CLOSE (11) [0x200006990:0x414:0x0] [0:0x0:0x0]  rc=0 #####
            

            It appears that the log processing restarts at some point, but I don't know what the cause is. It might be ChangeLog wrapping in the kernel?

            :
            :
            ***** Start 5519 CLOSE (11) [0x200006990:0x460:0x0] [0:0x0:0x0]  *****
            ##### End 5519 CLOSE (11) [0x200006990:0x460:0x0] [0:0x0:0x0]  rc=0 #####
            ***** Start 5520 CLOSE (11) [0x200006990:0x460:0x0] [0:0x0:0x0]  *****
            ##### End 5520 CLOSE (11) [0x200006990:0x460:0x0] [0:0x0:0x0]  rc=0 #####
            ***** Start 4823 CREAT (1) [0x200006990:0x397:0x0] [0x200006990:0x395:0x0] client.txt *****
            dest = d2b.lustre-rsync-test/client.txt; savedpath = d2b.lustre-rsync-test/client.txt
            mkfile(1) /tmp/target/d2b.lustre-rsync-test/client.txt 
            mkfile(1) /tmp/target2/d2b.lustre-rsync-test/client.txt 
            ##### End 4823 CREAT (1) [0x200006990:0x397:0x0] [0x200006990:0x395:0x0] client.txt rc=-17 #####
            ***** Start 4824 CLOSE (11) [0x200006990:0x397:0x0] [0:0x0:0x0]  *****
            ##### End 4824 CLOSE (11) [0x200006990:0x397:0x0] [0:0x0:0x0]  rc=0 #####
            :
            :
            
            adilger Andreas Dilger added a comment - I think it is strange that the error -17 = -EEXIST is the only one being hit. Is the MDS not properly registering or processing unlink records? Is lustre_rsync returning an error for CREAT records (e.g. using O_EXCL ) when they should be allowed to open(O_CREAT) files that already exist? Is lustre_rsync not applying the CREAT records in the right parent directory? Is it worthwhile to submit a test patch to collect the full changelog from the dbench run so that it is possible to trace the complete lifetime of a parent directory and file so that this problem can be debugged more effectively? Looking at the lustre_rsync log it appears that we are processing the ChangeLog entries twice? https://testing.hpdd.intel.com/sub_tests/1a5aee48-2be6-11e6-a0ce-5254006e85c2 https://testing.hpdd.intel.com/test_logs/1a89dee2-2be6-11e6-a0ce-5254006e85c2/show_text https://testing.hpdd.intel.com/test_logs/1aa09a56-2be6-11e6-a0ce-5254006e85c2/show_text For example, the file "STUDENTS.TV" FID [0x200006990:0x414:0x0] is created in record 5056 but that record appears twice in the log and the second time produces the -EEXIST error: **** Start 5056 CREAT (1) [0x200006990:0x414:0x0] [0x200006990:0x3f0:0x0] STUDENTS.TV ***** dest = d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV; savedpath = d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV mkfile(1) /tmp/target/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV Syncing data and attributes [0x200006990:0x414:0x0] llistxattr(/mnt/lustre/.lustre/fid/[0x200006990:0x414:0x0],(nil)) returned 64, errno=0 llistxattr /mnt/lustre/.lustre/fid/[0x200006990:0x414:0x0] returned 64, errno=0 (trusted.version,10469904) rc=0x8 lsetxattr(), rc=0, errno=0 (trusted.link,10469904) rc=0x35 lsetxattr(), rc=0, errno=0 (trusted.lov,10469904) rc=0x38 lsetxattr(), rc=0, errno=0 (trusted.lma,10469904) rc=0x18 lsetxattr(), rc=0, errno=0 (lustre.lov,10469904) rc=0x38 lsetxattr(), rc=-1, errno=95 setxattr: /mnt/lustre/.lustre/fid/[0x200006990:0x414:0x0] /tmp/target/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV mkfile(1) /tmp/target2/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV Syncing data and attributes [0x200006990:0x414:0x0] llistxattr(/mnt/lustre/.lustre/fid/[0x200006990:0x414:0x0],0x9fb200) returned 64, errno=0 (trusted.version,10469904) rc=0x8 lsetxattr(), rc=0, errno=0 (trusted.link,10469904) rc=0x35 lsetxattr(), rc=0, errno=0 (trusted.lov,10469904) rc=0x38 lsetxattr(), rc=0, errno=0 (trusted.lma,10469904) rc=0x18 lsetxattr(), rc=0, errno=0 (lustre.lov,10469904) rc=0x38 lsetxattr(), rc=-1, errno=95 setxattr: /mnt/lustre/.lustre/fid/[0x200006990:0x414:0x0] /tmp/target2/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV ##### End 5056 CREAT (1) [0x200006990:0x414:0x0] [0x200006990:0x3f0:0x0] STUDENTS.TV rc=0 ##### : : ***** Start 5058 CLOSE (11) [0x200006990:0x414:0x0] [0:0x0:0x0] ***** ##### End 5058 CLOSE (11) [0x200006990:0x414:0x0] [0:0x0:0x0] rc=0 ##### : : : ***** Start 5056 CREAT (1) [0x200006990:0x414:0x0] [0x200006990:0x3f0:0x0] STUDENTS.TV ***** dest = d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV; savedpath = d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV mkfile(1) /tmp/target/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV mkfile(1) /tmp/target2/d2b.lustre-rsync-test/clients/client1/~dmtmp/PARADOX/STUDENTS.TV ##### End 5056 CREAT (1) [0x200006990:0x414:0x0] [0x200006990:0x3f0:0x0] STUDENTS.TV rc=-17 ##### ***** Start 5058 CLOSE (11) [0x200006990:0x414:0x0] [0:0x0:0x0] ***** ##### End 5058 CLOSE (11) [0x200006990:0x414:0x0] [0:0x0:0x0] rc=0 ##### It appears that the log processing restarts at some point, but I don't know what the cause is. It might be ChangeLog wrapping in the kernel? : : ***** Start 5519 CLOSE (11) [0x200006990:0x460:0x0] [0:0x0:0x0] ***** ##### End 5519 CLOSE (11) [0x200006990:0x460:0x0] [0:0x0:0x0] rc=0 ##### ***** Start 5520 CLOSE (11) [0x200006990:0x460:0x0] [0:0x0:0x0] ***** ##### End 5520 CLOSE (11) [0x200006990:0x460:0x0] [0:0x0:0x0] rc=0 ##### ***** Start 4823 CREAT (1) [0x200006990:0x397:0x0] [0x200006990:0x395:0x0] client.txt ***** dest = d2b.lustre-rsync-test/client.txt; savedpath = d2b.lustre-rsync-test/client.txt mkfile(1) /tmp/target/d2b.lustre-rsync-test/client.txt mkfile(1) /tmp/target2/d2b.lustre-rsync-test/client.txt ##### End 4823 CREAT (1) [0x200006990:0x397:0x0] [0x200006990:0x395:0x0] client.txt rc=-17 ##### ***** Start 4824 CLOSE (11) [0x200006990:0x397:0x0] [0:0x0:0x0] ***** ##### End 4824 CLOSE (11) [0x200006990:0x397:0x0] [0:0x0:0x0] rc=0 ##### : :

            People

              jhammond John Hammond
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: