[LU-15431] read may return stale data from replicated file Created: 11/Jan/22  Updated: 27/Nov/23  Resolved: 14/Jul/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: Alex Zhuravlev Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: flr-improvement

Issue Links:
Duplicate
is duplicated by LU-15434 evicted client can return stale data ... Resolved
Related
is related to LU-11695 disabling the xattr cache on client f... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807
Epic Link: FLR tech debt review

 Description   

FLR doesn't hold layout lock (LL) during IO (OST_READ) operations, but verify layout at the beginning of IO and at the end of IO.
the problem is that pages get Uptodate flag from brw_interpret context while the layout is checked for changes later. the pages are still marked uptodate till that point.
three processes:

  • 1st modifying the file
  • 2nd initiated OST_READ and going to repeat OST_READ due to layout change
  • 3rd which finds result of OST_READ (stale data) in ll_do_fast_read()


 Comments   
Comment by Alex Zhuravlev [ 11/Jan/22 ]

Gentelmens, please have a look

Comment by Andreas Dilger [ 21/Jan/22 ]

A second, more serious case can happen:

  • 1st client writes to file, MDS marks some replica(s) stale
  • 2nd client has cached OST extent lock of (now) stale objects
  • 2nd client only fast-reads from pages cached under extent lock of stale object (eg. 4KB/2KB/1KB sized/aligned reads).

Because fast reads do not check DLM locks or layout version (they assume that if a page is in cache that it is valid), then there is nothing on the second client that will trigger layout refresh or reading from the primary OST object.

At some point, the 2md client would run out of cached pages, or possibly have a read that crosses a page boundary that is not "fast" and would check the layout version so that reads go to the primary object, but this could happen seconds after the cached object became stale.

It also isn't clear if there is a mechanism for the layout update on the client to cause all cached pages on the stale replica objects to be flushed from cache, or if only OST reads for uncached pages access the primary replica objects?

Comment by Patrick Farrell [ 21/Jan/22 ]

Yuck, that’s true.

We definitely need to add a layout version check to the fast read code, but we also need to add some way to flush the old layout and data.  That I am not sure how best to do - layout changes very carefully don’t flush everything today, but perhaps they should.

It would be something like “I detect a layout version change, so I flush all local data”.

I’m also a bit concerned about pending writes in cache - the FLR state cycle for the layout (read only, pending, writable) covers ongoing writes nicely, but doesn’t seem to handle dirty data already in cache.  It seems like maybe losing the layout lock should cause a flush to disk of any dirty data for the file?

And maybe losing the layout lock should just cause a complete cache flush/destroy.  Yuck.

Comment by Patrick Farrell [ 21/Jan/22 ]

So today, losing the layout lock ensures any ongoing I/O is restarted.  I think that’s handled ok.  But it does nothing for cached data, dirty or clean.

The layout lock is already required for doing I/O to a file.  Perhaps losing it should just cause a complete cache flush for that file?  Honestly this seems like the only safe option.  Adding version checks doesn’t solve the “dirty data in cache” problem and it could leave data and locks linked to the old layout staying in cache in a weird state where they were inaccessible but still present.

I’m not quite sure how that flush would interact with ongoing I/Os, but I don’t think I see another safe way to do this.

Comment by Patrick Farrell [ 21/Jan/22 ]

And to answer the unasked question about the cost of doing that - yes, it would be expensive and slow if a file is open many places.  But if a file is open many places why are you changing the layout?  Ugh - I note this interacts poorly with SEL, which would change the layout of a file open many places and rely on not destroying the cache to keep it reasonable.

I don’t know if we can respect that (and I don’t know if SEL has ever come in to real use.  I suspect not.).

Comment by Alex Zhuravlev [ 21/Jan/22 ]

probably cache invalidation should be driven indirectly by layout version change on OST, not by layout lock itself.

I’m also a bit concerned about pending writes in cache

LU-15300

Comment by Andreas Dilger [ 21/Jan/22 ]

Hmm, I ran a manual test to see if I could create the "fast read from stale mirror" problem, but it seems to be working as it should. Steps were roughly as follows:

client1# cp /etc/hosts /mnt/testfs/flr
client1# lfs mirror extend -N -c1 /mnt/testfs/flr
client1# while sleep 1; do dd if=/mnt/testfs/flr bs=20 count=1; done &
[verify via ldlm.namespaces.*.lru_size lock is on mirror#1]
client2# lfs mirror write -N2 -i /etc/passwd /mnt/testfs/flr
[note "lfs mirror write" does not mark mirror#1 stale]
[client1 still reading /etc/hosts from mirror#1]
client2# lfs setstripe --comp-set -I 65537 --comp-flags stale /mnt/testfs/flr
[client1 immediately changed to /etc/passwd data]

So it appears this is working correctly, and client1 either had its DLM lock on the stale object cancelled, or its layout lock update caused it to dump the local page cache for the stale object. IMHO, having the OST extent locks on the stale objects cancelled would be more robust, as this avoids dependency on the layout lock being updated on the client and any interaction with fast reads.

Side notes:

  • "--comp-set" is not listed in the "lfs setstripe" usage message, so does not appear in tab completion (fixed in patch https://review.whamcloud.com/41423 "LU-14385 tests: verify lfs setstripe comp-flags and flags options" on master).
  • it would be useful to have an "lfs mirror" equivalent to mark a mirror stale
  • it would be useful to have an option to "lfs mirror write" to mark the other mirrors stale. Unlike normal writes, this would allow selecting a specific mirror for write.
  • this test case is the only one so far that I wanted to be able to split prefer-read from prefer-write, to verify that the read was mirror#1 and the write was mirror#2
  • there wasn't an easy way to verify which mirror the client1 reads were using and which objects pages were cached. I cleared the DLM cache and used lru_count to see which OSC had a lock, but not great. (LU-14858)
  • the llite.*.dump_page_cache output only showed the MDT FID and not the OST FID used.
  • it would be very useful if dump_page cache had a "comment" line at the top to describe the fields
Comment by Andreas Dilger [ 21/Jan/22 ]

I don't think we need to cancel locks for every version update, since that happens frequently. Only for updates that mark components stale, which is rarely.

Comment by Patrick Farrell [ 21/Jan/22 ]

So what I've learned so far is specifically the setstripe command to change the flags

lfs setstripe --comp-set -I 65537 --comp-flags stale /mnt/testfs/flr 

Appears to trigger a flush.

Specifically, it calls llapi_file_flush() (internally ioctl(LL_IOC_DATA_VERSION)), which sets LL_DV_WR_FLUSH in a command it sends to all the OSTs.

And that's turned in to:

                if (dv->dv_flags & LL_DV_WR_FLUSH)
                        oa->o_flags |= OBD_FL_FLUSH; 

Which causes the OST to flush.

So changing the prefer flag manually doesn't work.

Andreas, your suggestion to expose the mirror write and mirror read 'prefer' flags separately suddenly seems relevant, because I need to read data from one mirror then write data to the other without changing the layout.  I could also achieve this by taking one mirror temporarily offline but that's kind of challenging on my setup...  I'll think about it.

Comment by Andreas Dilger [ 21/Jan/22 ]

I see that llapi_file_flush() is also called in mirror_extend_file(), but not in llapi_mirror_resync_many(), which would be a simple change to make, but doesn't solve the whole problem. Alternately, the resync should probably be bracketed with an llapi_get_data_version() call at the start/end so that it can verify that the data hasn't changed during resync? Otherwise, I don't see how we are certain that the file hasn't been modified since the resync started?

aside: the use of llapi_get_data_version() in lsom_update_one() for the "flush client data" side effect should be replaced with llapi_file_flush() to make this explicit, and avoid problems in the future.

Comment by Patrick Farrell [ 21/Jan/22 ]

I believe mirror resync is going to depend on the FLR state over on the MDS, that state machine with the cycle from readonly to stale, etc.  So it doesn't need to verify anything - It controls the state until the resync is complete, then once it's complete, well, things can change after that.

That's just what I believe, though.

Comment by Patrick Farrell [ 21/Jan/22 ]

OK, I (finally) found it.  We're fine for reads, though I still need to double check on writes.  The comment in ll_readpage needs to be corrected.

I noticed when a mirror became stale, the data stayed in cache.  But the stale data was being reliably destroyed at the start of the next I/O, which confused me.  I finally figured it out.

When the layout lock has been lost, when we next open the file (ie, before we go to read it), we end up calling in to ll_iget - it looks like we clear some part of the object metadata, I'm not quite sure what - and that calls in to the layout conf code.

That eventually calls in to lov_layout_change, which calls cl_object_prune and destroys all of the OSC objects associated with the file, which destroys the cached pages.  All before the 'read' code starts.  This should also work for a file which is held open because I believe Lustre should force a dentry refresh, etc, which should have the same effect?

I'll talk about writes in a second.

Comment by Patrick Farrell [ 21/Jan/22 ]

As I reflect on pending writes, I realize I forgot the basic FLR lifecycle.

If there is a pending write (ie, dirty data), the mirrors can't be in sync.  The I/O which created that dirty data brings the mirrors out of sync.  And bringing them back in to sync does flush all the data - that's a requirement when implementing lfs mirror resync.

So writes are fine too.  I think this is all fine except that comment is misleading (and probably a leftover from an earlier design).

Comment by Alex Zhuravlev [ 22/Jan/22 ]

please check the original description - it was about early setting of PG_uptodate before the 2nd layout check. if the 2nd check is not required, then why is it here?

Comment by Andreas Dilger [ 22/Jan/22 ]

When the layout lock has been lost, when we next open the file (ie, before we go to read it), we end up calling in to ll_iget - it looks like we clear some part of the object metadata, I'm not quite sure what - and that calls in to the layout conf code.

So this depends on the client being properly notified about the layout lock and canceling its own cache. It would be more robust if the OST extent lock was revoked from objects in the stale mirror(s) at the time they are marked stale, so that the client couldn't continue to read from the old objects.

This won't help in the case of a totally isolated client (i.e. not getting callbacks from either MDS or OSS), but it does help in the case of the client being evicted/incommunicado with the MDS, but still able to contact the OSS to read data.

Comment by Gerrit Updater [ 22/Jan/22 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46273
Subject: LU-15431 tests: a demo
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a6662c7ba1d1785e10bfacbc564a0d9d3ed4f44c

Comment by Alex Zhuravlev [ 22/Jan/22 ]

please, have a look at https://review.whamcloud.com/#/c/46273/
basically the test reads inbetween llapi_file_flush() and layout merge in lod_xattr_set(), then repeat the same read in the same process with the same old fd and gets old data. unless I missed something.
the output:

[   48.723121] Lustre: Mounted lustre-client
AAAA written:
osd-ldiskfs.lustre-OST0000.nonrotational=0
osd-ldiskfs.lustre-OST0001.nonrotational=0
osd-ldiskfs.lustre-OST0001.nonrotational=1
fail_loc=0x16f
fail_val=4
create replicate in background
[   49.049398] LustreError: 5192:0:(fail.c:138:__cfs_fail_timeout_set()) cfs_fail_timeout id 16f sleeping for 4ms
[   49.150019] LustreError: 5192:0:(fail.c:149:__cfs_fail_timeout_set()) cfs_fail_timeout id 16f awake
[   49.154344] lustre-OST0000: OST_READ from [0x100000000:0x3:0x0]
read right before layout merge, after data flush
[   49.549402] lustre-OST0001: OST_READ from [0x100010000:0x2:0x0]
wait for replication
replication done
BBBB written
fast read from pagecache
DATA: AAAA
DATA: AAAA
read via new open(2)
[   51.576003] lustre-OST0000: OST_READ from [0x100000000:0x3:0x0]
DATA: BBBB
[   51.586037] lustre-OST0000: OST_READ from [0x100000000:0x3:0x0]
DATA: BBBB
Comment by Alex Zhuravlev [ 23/Jan/22 ]

here is another test, with no injections with cfs_fail:

test_2000() {
	local tf=$DIR/$tfile
	local osts=$(comma_list $(osts_nodes))

	mkdir -p $MOUNT2 && mount_client $MOUNT2

	# to make replica on ost1 preferred for new writes
	do_nodes $osts \
		$LCTL set_param osd*.*OST*.nonrotational=0
	do_nodes $osts \
		$LCTL set_param osd*.*OST0001*.nonrotational=1

	$LFS setstripe -c1 -i0 $tf || errro "can't create $tf"
	echo "AAAA" >$tf
	$LFS mirror extend -N -o1 $tf || error "can't make replica"
	log2 "replicated file created"

	cancel_lru_locks mdc
	cancel_lru_locks osc

	log2 "open(O_RDONLY) and first read from OST"
	$MULTIOP $tf oO_RDONLY:P4_z0P4_z0P4c &
	PID=$!
	sleep 1
	log2 "first read complete"

	echo "BBBB" >/tmp//$tfile
	dd if=/tmp/$tfile of=$DIR2/$tfile conv=notrunc >&/dev/null ||
		error "can't write BBBB"
	log2 "BBBB written which made replica on ost1 stale"

	log2 "fast read from pagecache in the original process"
	kill -USR1 $PID
	sleep 1

	log2 "read via $DIR2 new open(2)"
	$MULTIOP $DIR2/$tfile oO_RDONLY:P4c

	log2 "fast read from pagecache after 5s in the original process"
	sleep 5
	kill -USR1 $PID
	wait $PID

	log2 "read via $DIR new open(2)"
	$MULTIOP $tf oO_RDONLY:P4c
}
run_test 2000 "test2"

and the output:

== sanity-flr test 2000: test2 =========================== 20:26:32 (1642969592)
[   47.814975] Lustre: DEBUG MARKER: == sanity-flr test 2000: test2 =========================== 20:26:32 (1642969592)
Starting client: tmp.y4sbWzerEx:  -o user_xattr,flock tmp.y4sbWzerEx@tcp:/lustre /mnt/lustre2
[   47.982943] Lustre: Mounted lustre-client
osd-ldiskfs.lustre-OST0000.nonrotational=0
osd-ldiskfs.lustre-OST0001.nonrotational=0
osd-ldiskfs.lustre-OST0001.nonrotational=1
[   48.133925] Lustre: DEBUG MARKER: replicated file created
[   48.175340] Lustre: DEBUG MARKER: open(O_RDONLY) and first read from OST
DATA: AAAA in 6318
[   49.196269] Lustre: DEBUG MARKER: first read complete
[   49.222184] Lustre: DEBUG MARKER: BBBB written which made replica on ost1 stale
[   49.229754] Lustre: DEBUG MARKER: fast read from pagecache in the original process
DATA: AAAA in 6318
[   50.245148] Lustre: DEBUG MARKER: read via /mnt/lustre2 new open(2)
DATA: BBBB in 6326
[   50.261963] Lustre: DEBUG MARKER: fast read from pagecache after 5s in the original process
DATA: AAAA in 6318
[   55.279692] Lustre: DEBUG MARKER: read via new open(2)
DATA: BBBB in 6330

notice read via 2nd client (/mnt/lustre2) gets BBBB and still after that the first client reads AAAA again (I checked the logs - it was fast read path).

I used a trivial change to multiop: command P reads <num> bytes and prints as a string.

Comment by Gerrit Updater [ 24/Jan/22 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46282
Subject: LU-15431 llite: skip fast reads if layout is invalid
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 371fa1b9153a0ea765f5aea73b35e00a25be0b97

Comment by Alex Zhuravlev [ 24/Jan/22 ]

So this depends on the client being properly notified about the layout lock and canceling its own cache. It would be more robust if the OST extent lock was revoked from objects in the stale mirror(s) at the time they are marked stale, so that the client couldn't continue to read from the old objects.

I think there are two options:

  • update LV (layout version) on now-stale OST objects, which in turn must cancel all OST locks
  • just stop to trust pagecache upon LL cancellation (implmented in the patch above)
Comment by Colin Faber [ 25/Jul/22 ]

Hi bzzz 

What are the next steps here? I thought we went with option 2, however with your patch above, shadow is saying -1. Can you please advise?

Comment by Alex Zhuravlev [ 26/Jul/22 ]

What are the next steps here? I thought we went with option 2, however with your patch above, shadow is saying -1. Can you please advise?

it'd be great it Patrick could step in and review the patch.

Comment by Patrick Farrell [ 05/Jul/23 ]

I'm not 100% sure we've closed all the races here, but I think the patch is good.  Just noting we might still have something that could happen, not 100% sure.

Comment by Alex Zhuravlev [ 10/Jul/23 ]

I'm 100% sure hte patch closes at least one hole

Comment by Gerrit Updater [ 14/Jul/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/46282/
Subject: LU-15431 llite: skip fast reads if layout is invalid
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fe2fafa1af7edc251009e3fbd46665e05573bf4a

Comment by Peter Jones [ 14/Jul/23 ]

Landed for 2.16

Comment by Gerrit Updater [ 27/Nov/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53247
Subject: LU-15431 llite: skip fast reads if layout is invalid
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 2cfbd62a4bbcdb59c8a54ddc49b8617bbb363fc4

Generated at Sat Feb 10 03:18:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.