[LU-17124] fiemap FIEMAP_FLAG_SYNC flag expects filemap_write_and_wait() or similar Created: 18/Sep/23  Updated: 15/Nov/23  Resolved: 28/Sep/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: Shaun Tancheff Assignee: Shaun Tancheff
Resolution: Fixed Votes: 0
Labels: None

Attachments: File SHOT02_09010.exr.trace.bad     File SHOT02_09010.exr.trace.good    
Issue Links:
Related
is related to LU-11621 Add copy_file_range() API and use it ... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

fiemap FIEMAP_FLAG_SYNC can race while client is writing data to disk

In such a case fiemap() call returns that the data is not on disk (no data for the range) and cp can just truncates (or sparse fill) based on the size of the file/extent.

strace shows that FIEMAP_FLAG_SYNC was sent. Further and the user reports that a 'sync; cp <blah>' does not fail and a newer cp that uses copy_file_range() also does not fail.

Looking further FIEMAP_FLAG_SYNC expects the data to be on disk aka filemap_write_and_wait() not just filemap_fdatawrite()



 Comments   
Comment by Gerrit Updater [ 18/Sep/23 ]

"Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52402
Subject: LU-17124 llite: Write and wait on FIEMAP_FLAG_SYNC
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 39825b198f06234da2876637660663981d866bf1

Comment by Gerrit Updater [ 28/Sep/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52402/
Subject: LU-17124 llite: Write and wait on FIEMAP_FLAG_SYNC
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1af966fe6983f28e9d438b57944c4dccb8941c61

Comment by Peter Jones [ 28/Sep/23 ]

Landed for 2.16

Comment by Lukasz Flis [ 14/Nov/23 ]

Cyfronet here,

Server: 2.15.3
Client 2.15.3 +  LU-17124 llite: Write and wait on FIEMAP_FLAG_SYNC
Cllient OS: Rocky 9, cp is using copy_file_range instead of read/write

We have observed the same issue prior applying the patch. When copying 4096 files from one directory to another (on the same fs), in parallel using 32 processes, some copies where corrupted. Some files were truncated more or less, around 10-30 destination files corrupted in total.

Having applied the patch we observe less corrupted files and less bytes are missing. 

It seems that solution doesn't fix the problem completly, or maybe we miss some other patches between 2.15.3 and 2.16.0

Comment by Lukasz Flis [ 14/Nov/23 ]

This is how we copy the files:

find ./source_dir -name *.exr | xargs -P32 -I'{}' cp -v {} ./dest_dir

Parameter -P32 keeps 32 processes running in parallel

Comment by Shaun Tancheff [ 14/Nov/23 ]

If you are copying from Lustre to Lustre on the same file system then I do not see this patch would have any obvious effect on your use case. I do not see fiemap() being involved on el9.2 when copy_file_range() can be used.

So I think your find | xargs -P32 cp <args> case is may be exposing a separate bug.

If you could capture strace(s) of the cp -v and provide / attach a 'good' sample and a 'bad' sample we may be able to see what is happening.

Comment by Lukasz Flis [ 14/Nov/23 ]

Two traces in the attachment, both captured for the same file

#good
...
uname({sysname="Linux", nodename="t0006", ...}) = 0
copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 62924467
copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 0
...
#bad
...
uname({sysname="Linux", nodename="t0006", ...}) = 0
copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 19005440
copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 1441792
copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 0            <= short read?
...

 

Comment by Lukasz Flis [ 14/Nov/23 ]

Quick update.
Forcing fsync(fd_out) before copy_file_range() call fixes the problem with truncated output files

 

 /*
 *   gcc -c -Wall -Werror -fpic ./this.c -o cfr.o
 *   gcc -ldl -shared -o ./cfr.so ./cfr.o
 *   LD_PRELOAD=/full_path_to/cfr.so cp A B
 *
 * */

#define _GNU_SOURCE 1
#include <dlfcn.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

typedef ssize_t *(*cfr_t)(int, loff_t*, int, loff_t*, size_t, unsigned int);
static cfr_t p = NULL;
ssize_t copy_file_range(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags)
{
    fsync(fd_out);
    ssize_t r=-1;
    if (p == NULL) {
        p =  dlsym(RTLD_NEXT, "copy_file_range");
        if (p == NULL) {
            /* Error handling */
            return r;
        }
    }    r = (ssize_t) p(fd_in, off_in, fd_out, off_out, len,flags);
    return r;
}

 

 

 

 

 

Comment by Gerrit Updater [ 15/Nov/23 ]

"Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53140
Subject: LU-17124 llite: sync on splice write
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c6c5e7c6664fcaeb4b4df51f6e0562f35a91d985

Comment by Shaun Tancheff [ 15/Nov/23 ]

Please try the patch and see if the sync() occurs early enough to resolve the corruption you are finding.

If not we may need to implement a file-system specific copy_file_range()

Thanks!

Comment by Lukasz Flis [ 15/Nov/23 ]

Thank you for the patch. We tested the changes and unfortunately we still see corrupted destnation files. 

 

Comment by James A Simmons [ 15/Nov/23 ]

A ticket exist for copy_file_range(). I just never got the cycles to implement for Lustre. Also RHEL7 doesn't support a proper hook for copy_file_range.

Generated at Sat Feb 10 03:32:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.