Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17124

fiemap FIEMAP_FLAG_SYNC flag expects filemap_write_and_wait() or similar

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      fiemap FIEMAP_FLAG_SYNC can race while client is writing data to disk

      In such a case fiemap() call returns that the data is not on disk (no data for the range) and cp can just truncates (or sparse fill) based on the size of the file/extent.

      strace shows that FIEMAP_FLAG_SYNC was sent. Further and the user reports that a 'sync; cp <blah>' does not fail and a newer cp that uses copy_file_range() also does not fail.

      Looking further FIEMAP_FLAG_SYNC expects the data to be on disk aka filemap_write_and_wait() not just filemap_fdatawrite()

      Attachments

        Issue Links

          Activity

            [LU-17124] fiemap FIEMAP_FLAG_SYNC flag expects filemap_write_and_wait() or similar

            A ticket exist for copy_file_range(). I just never got the cycles to implement for Lustre. Also RHEL7 doesn't support a proper hook for copy_file_range.

            simmonsja James A Simmons added a comment - A ticket exist for copy_file_range(). I just never got the cycles to implement for Lustre. Also RHEL7 doesn't support a proper hook for copy_file_range.
            lflis Lukasz Flis added a comment -

            Thank you for the patch. We tested the changes and unfortunately we still see corrupted destnation files. 

             

            lflis Lukasz Flis added a comment - Thank you for the patch. We tested the changes and unfortunately we still see corrupted destnation files.   

            Please try the patch and see if the sync() occurs early enough to resolve the corruption you are finding.

            If not we may need to implement a file-system specific copy_file_range()

            Thanks!

            stancheff Shaun Tancheff added a comment - Please try the patch and see if the sync() occurs early enough to resolve the corruption you are finding. If not we may need to implement a file-system specific copy_file_range() Thanks!

            "Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53140
            Subject: LU-17124 llite: sync on splice write
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c6c5e7c6664fcaeb4b4df51f6e0562f35a91d985

            gerrit Gerrit Updater added a comment - "Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53140 Subject: LU-17124 llite: sync on splice write Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c6c5e7c6664fcaeb4b4df51f6e0562f35a91d985
            lflis Lukasz Flis added a comment - - edited

            Quick update.
            Forcing fsync(fd_out) before copy_file_range() call fixes the problem with truncated output files

             

             /*
             *   gcc -c -Wall -Werror -fpic ./this.c -o cfr.o
             *   gcc -ldl -shared -o ./cfr.so ./cfr.o
             *   LD_PRELOAD=/full_path_to/cfr.so cp A B
             *
             * */
            
            #define _GNU_SOURCE 1
            #include <dlfcn.h>
            #include <unistd.h>
            #include <stdio.h>
            #include <sys/types.h>
            #include <sys/stat.h>
            #include <fcntl.h>
            
            typedef ssize_t *(*cfr_t)(int, loff_t*, int, loff_t*, size_t, unsigned int);
            static cfr_t p = NULL;
            ssize_t copy_file_range(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags)
            {
                fsync(fd_out);
                ssize_t r=-1;
                if (p == NULL) {
                    p =  dlsym(RTLD_NEXT, "copy_file_range");
                    if (p == NULL) {
                        /* Error handling */
                        return r;
                    }
                }    r = (ssize_t) p(fd_in, off_in, fd_out, off_out, len,flags);
                return r;
            }

             

             

             

             

             

            lflis Lukasz Flis added a comment - - edited Quick update. Forcing fsync(fd_out) before copy_file_range() call fixes the problem with truncated output files   /*  *   gcc -c -Wall -Werror -fpic ./ this .c -o cfr.o  *   gcc -ldl -shared -o ./cfr.so ./cfr.o  *   LD_PRELOAD=/full_path_to/cfr.so cp A B  *  * */ #define _GNU_SOURCE 1 #include <dlfcn.h> #include <unistd.h> #include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> typedef ssize_t *(*cfr_t)( int , loff_t*, int , loff_t*, size_t, unsigned int ); static cfr_t p = NULL; ssize_t copy_file_range( int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags) {     fsync(fd_out);     ssize_t r=-1;     if (p == NULL) {         p =  dlsym(RTLD_NEXT, "copy_file_range" );         if (p == NULL) {             /* Error handling */             return r;         }     }    r = (ssize_t) p(fd_in, off_in, fd_out, off_out, len,flags);     return r; }          
            lflis Lukasz Flis added a comment -

            Two traces in the attachment, both captured for the same file

            #good
            ...
            uname({sysname="Linux", nodename="t0006", ...}) = 0
            copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 62924467
            copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 0
            ...
            #bad
            ...
            uname({sysname="Linux", nodename="t0006", ...}) = 0
            copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 19005440
            copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 1441792
            copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 0            <= short read?
            ...
            
            

             

            lflis Lukasz Flis added a comment - Two traces in the attachment, both captured for the same file #good ... uname({sysname="Linux", nodename="t0006", ...}) = 0 copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 62924467 copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 0 ... #bad ... uname({sysname="Linux", nodename="t0006", ...}) = 0 copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 19005440 copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 1441792 copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 0 <= short read? ...  

            If you are copying from Lustre to Lustre on the same file system then I do not see this patch would have any obvious effect on your use case. I do not see fiemap() being involved on el9.2 when copy_file_range() can be used.

            So I think your find | xargs -P32 cp <args> case is may be exposing a separate bug.

            If you could capture strace(s) of the cp -v and provide / attach a 'good' sample and a 'bad' sample we may be able to see what is happening.

            stancheff Shaun Tancheff added a comment - If you are copying from Lustre to Lustre on the same file system then I do not see this patch would have any obvious effect on your use case. I do not see fiemap() being involved on el9.2 when copy_file_range() can be used. So I think your find | xargs -P32 cp <args> case is may be exposing a separate bug. If you could capture strace(s) of the cp -v and provide / attach a 'good' sample and a 'bad' sample we may be able to see what is happening.
            lflis Lukasz Flis added a comment -

            This is how we copy the files:

            find ./source_dir -name *.exr | xargs -P32 -I'{}' cp -v {} ./dest_dir

            Parameter -P32 keeps 32 processes running in parallel

            lflis Lukasz Flis added a comment - This is how we copy the files: find ./source_dir -name *.exr | xargs -P32 -I '{}' cp -v {} ./dest_dir Parameter -P32 keeps 32 processes running in parallel
            lflis Lukasz Flis added a comment -

            Cyfronet here,

            Server: 2.15.3
            Client 2.15.3 +  LU-17124 llite: Write and wait on FIEMAP_FLAG_SYNC
            Cllient OS: Rocky 9, cp is using copy_file_range instead of read/write

            We have observed the same issue prior applying the patch. When copying 4096 files from one directory to another (on the same fs), in parallel using 32 processes, some copies where corrupted. Some files were truncated more or less, around 10-30 destination files corrupted in total.

            Having applied the patch we observe less corrupted files and less bytes are missing. 

            It seems that solution doesn't fix the problem completly, or maybe we miss some other patches between 2.15.3 and 2.16.0

            lflis Lukasz Flis added a comment - Cyfronet here, Server: 2.15.3 Client 2.15.3 +   LU-17124  llite: Write and wait on FIEMAP_FLAG_SYNC Cllient OS: Rocky 9, cp is using copy_file_range instead of read/write We have observed the same issue prior applying the patch. When copying 4096 files from one directory to another (on the same fs), in parallel using 32 processes, some copies where corrupted. Some files were truncated more or less, around 10-30 destination files corrupted in total. Having applied the patch we observe less corrupted files and less bytes are missing.  It seems that solution doesn't fix the problem completly, or maybe we miss some other patches between 2.15.3 and 2.16.0
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            People

              stancheff Shaun Tancheff
              stancheff Shaun Tancheff
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: