[LU-4388] fsync on client does not cause OST_SYNCs to be issued Created: 17/Dec/13  Updated: 14/Jun/18  Resolved: 24/Apr/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0, Lustre 2.6.0
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Blocker
Reporter: Nathaniel Clark Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: MB

Issue Links:
Related
Severity: 3
Rank (Obsolete): 12048

 Description   

Issuing an fsync() only causes MDS_SYNC's to be issued.

Testing:
dd with conv=fsync vs. dd with oflag=sync



 Comments   
Comment by Andreas Dilger [ 19/Dec/13 ]

Looking at the ll_fsync() code path I see a number of gaps in the handling. This should be passing through the file range for fsync, since it is possible to sync only a range of the file. The Lustre OST protocol has always supported this, and it makes sense to implement it correctly.

On the OST side, it looks like ofd_sync() passes start/end correctly, but the new ofd_sync_hdl() does not. The whole callchain for tgt_sync->dt_object_sync->osd_object_sync->do_sync() need to take start and end arguments, and then pass it down to f_op->fsync() for those newer variants of fsync() that allow specifying a range of the file.

I pushed http://review.whamcloud.com/8626 to fix handling of

{start, end}

limits for fsync, but this will not fix the core problem of OST_SYNC not being sent at all. Looking at the twisty maze of CLIO that the fsync call passes through, it seems like the fsync() call should generate an OST_SYNC RPC, but I didn't look at actual debug logs to trace the callpath to see where it actually goes wrong.

It is probably worthwhile to create a test case that writes 900kB (less than one RPC, so it isn't sent immediately), fsync (via "multiop -y"), immediately fail the OST (mark it read-only) and abort recovery when fsync() returns, then remount the client and verify that the data actually made it to disk. There are sanity.sh test_24

{a,b}

() that test fsync() behaviour in terms of returning an error to the caller on failure, but there isn't a test that verifies fsync() is actually working correctly. Since we've had problems in this area several times (it is not a problem that can be seen easily, since the data is normally written to disk within a few seconds by itself anyway), it probably makes sense to start by writing this test first, and then using it to debug the problem.

Comment by Andreas Dilger [ 20/Dec/13 ]

Note that my patch is not addressing the core issue here, just a problem that I found when I was looking at this bug originally. Someone with more CLIO skill or time to debug needs to look at this problem.

Comment by Peter Jones [ 24/Dec/13 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Zhenyu Xu [ 24/Dec/13 ]

Nathaniel,

Can you describe the problem in detail? What's the full commands and the respective outputs?

Comment by Zhenyu Xu [ 27/Dec/13 ]

dd ... conv=fsync issues several ll_writepages() with wbc->sync_mode == WB_SYNC_ALL and it calls
cl_sync_file_range(inode, start, end, CL_FSYNC_LOCAL, ignore_layout);

after write finishes, it calls ll_fsync() with datasync==0

00000080:00000001:0.0:1388123592.960635:0:20741:0:(rw.c:1112:ll_writepages()) Process entered
...
00000080:00200000:0.0:1388123592.960646:0:20741:0:(file.c:2742:cl_sync_file_range()) VFS Op:inode=[0x200000400:0x1:0x0](ffff8800006da138), sync_mode=1
...
00000080:00000001:1.0:1388123592.973789:0:20741:0:(file.c:2778:cl_sync_file_range()) Process leaving (rc=0 : 0 : 0)
00000080:00000001:1.0:1388123592.973791:0:20741:0:(rw.c:1147:ll_writepages()) Process leaving (rc=0 : 0 : 0)
00000080:00000001:1.0:1388123592.973794:0:20741:0:(file.c:2804:ll_fsync()) Process entered
00000080:00200000:1.0:1388123592.973794:0:20741:0:(file.c:2807:ll_fsync()) VFS Op:inode=[0x200000400:0x1:0x0](ffff8800006da138), datasync=0
Comment by Zhenyu Xu [ 27/Dec/13 ]

Nathaniel,

Use conv=fdatasync will issue the OST_SYNC.

dd conv=fsync does not set the datasync parameter for fsync(), conv=fdatasync set it.

Comment by Andreas Dilger [ 27/Dec/13 ]

Does this mean there is a bug in out code that it doesn't send an OST_SYNC in case of conv=fsync?

Comment by Zhenyu Xu [ 27/Dec/13 ]

A little bit confused about the datasync parameter of fsync(), is it datasync=1 means only sync data while datasync=0 means sync data+metadata?

Comment by Andreas Dilger [ 30/Dec/13 ]

Looking at the comments around vfs_fsync() and vfs_fsync_range()
http://lxr.free-electrons.com/source/fs/sync.c#L178

 * vfs_fsync - perform a fsync or fdatasync on a file
 * @file:               file to sync
 * @datasync:           only perform a fdatasync operation
 *
 * Write back data and metadata for @file to disk.  If @datasync is
 * set only metadata needed to access modified file data is written.

It does appear that datasync=1 means sync data and enough metadata to get the data back (e.g. i_size and block allocations) while datasync=0 means sync everything. See also http://man7.org/linux/man-pages/man2/fsync.2.html

fdatasync() is similar to fsync(), but does not flush modified
metadata unless that metadata is needed in order to allow a
subsequent data retrieval to be correctly handled. For example,
changes to st_atime or st_mtime (respectively, time of last access
and time of last modification; see stat(2)) do not require flushing
because they are not necessary for a subsequent data read to be
handled correctly. On the other hand, a change to the file size
(st_size, as made by say ftruncate(2)), would require a metadata
flush.

So if datasync=0 then we should always be sending OST_SYNC RPCs, but if datasync=1 then that is only needed if the size is changed or if blocks were allocated. That is a difficult/impossible decision for a client to make, so the OST_SYNC RPC should be sent to the OST with a DATASYNC flag and let the OSD decide. For older branches it probably makes sense to always send OST_SYNC.

Comment by Jinshan Xiong (Inactive) [ 31/Dec/13 ]

I see. So this is a misunderstanding of @datasync parameter. This bug may exist in all series of lustre client implementation. Thanks for explanation.

Comment by Zhenyu Xu [ 31/Dec/13 ]

patch tracking at http://review.whamcloud.com/8684

Comment by Jodi Levi (Inactive) [ 12/Mar/14 ]

Patch has landed to Master. Please reopen if more work is needed in this ticket.

Comment by Andreas Dilger [ 21/Mar/14 ]

The patch http://review.whamcloud.com/8626 has not landed yet.

Comment by Jodi Levi (Inactive) [ 24/Apr/14 ]

Patch landed to Master.

Comment by Hans Henrik Happe [ 15/May/15 ]

Should this have been fixed after 2.5.0 or is it only 2.6.0? I see the same problem with b2_5 on zfs (fdatasync okay and slow, fsync not touching the disk). I think it is important that fsync works as expected.

Comment by Zhenyu Xu [ 16/May/15 ]

2.6.0 has the patches and 2.5.0 does not.

Comment by Hans Henrik Happe [ 17/May/15 ]

Given the severity (data loss, inconsistent databases, etc..), shouldn't it be included in the maintenance release?

Comment by Gerrit Updater [ 18/May/15 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/14840
Subject: LU-4388 llite: issue OST_SYNC for fsync()
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 3951a7c7aed155c206043838e4e800c94ac1c692

Comment by Gerrit Updater [ 25/Jun/15 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/15390
Subject: LU-4388 test: add sanity test case
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3112c90215374573c987a800c192300817c62200

Generated at Sat Feb 10 01:42:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.