[LU-15431] read may return stale data from replicated file Created: 11/Jan/22 Updated: 27/Nov/23 Resolved: 14/Jul/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Alex Zhuravlev | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | flr-improvement | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Epic Link: | FLR tech debt review | ||||||||||||||||
| Description |
|
FLR doesn't hold layout lock (LL) during IO (OST_READ) operations, but verify layout at the beginning of IO and at the end of IO.
|
| Comments |
| Comment by Alex Zhuravlev [ 11/Jan/22 ] |
|
Gentelmens, please have a look |
| Comment by Andreas Dilger [ 21/Jan/22 ] |
|
A second, more serious case can happen:
Because fast reads do not check DLM locks or layout version (they assume that if a page is in cache that it is valid), then there is nothing on the second client that will trigger layout refresh or reading from the primary OST object. At some point, the 2md client would run out of cached pages, or possibly have a read that crosses a page boundary that is not "fast" and would check the layout version so that reads go to the primary object, but this could happen seconds after the cached object became stale. It also isn't clear if there is a mechanism for the layout update on the client to cause all cached pages on the stale replica objects to be flushed from cache, or if only OST reads for uncached pages access the primary replica objects? |
| Comment by Patrick Farrell [ 21/Jan/22 ] |
|
Yuck, that’s true. We definitely need to add a layout version check to the fast read code, but we also need to add some way to flush the old layout and data. That I am not sure how best to do - layout changes very carefully don’t flush everything today, but perhaps they should. It would be something like “I detect a layout version change, so I flush all local data”. I’m also a bit concerned about pending writes in cache - the FLR state cycle for the layout (read only, pending, writable) covers ongoing writes nicely, but doesn’t seem to handle dirty data already in cache. It seems like maybe losing the layout lock should cause a flush to disk of any dirty data for the file? And maybe losing the layout lock should just cause a complete cache flush/destroy. Yuck. |
| Comment by Patrick Farrell [ 21/Jan/22 ] |
|
So today, losing the layout lock ensures any ongoing I/O is restarted. I think that’s handled ok. But it does nothing for cached data, dirty or clean. The layout lock is already required for doing I/O to a file. Perhaps losing it should just cause a complete cache flush for that file? Honestly this seems like the only safe option. Adding version checks doesn’t solve the “dirty data in cache” problem and it could leave data and locks linked to the old layout staying in cache in a weird state where they were inaccessible but still present. I’m not quite sure how that flush would interact with ongoing I/Os, but I don’t think I see another safe way to do this. |
| Comment by Patrick Farrell [ 21/Jan/22 ] |
|
And to answer the unasked question about the cost of doing that - yes, it would be expensive and slow if a file is open many places. But if a file is open many places why are you changing the layout? Ugh - I note this interacts poorly with SEL, which would change the layout of a file open many places and rely on not destroying the cache to keep it reasonable. I don’t know if we can respect that (and I don’t know if SEL has ever come in to real use. I suspect not.). |
| Comment by Alex Zhuravlev [ 21/Jan/22 ] |
|
probably cache invalidation should be driven indirectly by layout version change on OST, not by layout lock itself.
|
| Comment by Andreas Dilger [ 21/Jan/22 ] |
|
Hmm, I ran a manual test to see if I could create the "fast read from stale mirror" problem, but it seems to be working as it should. Steps were roughly as follows: client1# cp /etc/hosts /mnt/testfs/flr client1# lfs mirror extend -N -c1 /mnt/testfs/flr client1# while sleep 1; do dd if=/mnt/testfs/flr bs=20 count=1; done & [verify via ldlm.namespaces.*.lru_size lock is on mirror#1] client2# lfs mirror write -N2 -i /etc/passwd /mnt/testfs/flr [note "lfs mirror write" does not mark mirror#1 stale] [client1 still reading /etc/hosts from mirror#1] client2# lfs setstripe --comp-set -I 65537 --comp-flags stale /mnt/testfs/flr [client1 immediately changed to /etc/passwd data] So it appears this is working correctly, and client1 either had its DLM lock on the stale object cancelled, or its layout lock update caused it to dump the local page cache for the stale object. IMHO, having the OST extent locks on the stale objects cancelled would be more robust, as this avoids dependency on the layout lock being updated on the client and any interaction with fast reads. Side notes:
|
| Comment by Andreas Dilger [ 21/Jan/22 ] |
|
I don't think we need to cancel locks for every version update, since that happens frequently. Only for updates that mark components stale, which is rarely. |
| Comment by Patrick Farrell [ 21/Jan/22 ] |
|
So what I've learned so far is specifically the setstripe command to change the flags lfs setstripe --comp-set -I 65537 --comp-flags stale /mnt/testfs/flr Appears to trigger a flush. Specifically, it calls llapi_file_flush() (internally ioctl(LL_IOC_DATA_VERSION)), which sets LL_DV_WR_FLUSH in a command it sends to all the OSTs. And that's turned in to:
if (dv->dv_flags & LL_DV_WR_FLUSH)
oa->o_flags |= OBD_FL_FLUSH;
Which causes the OST to flush. So changing the prefer flag manually doesn't work. Andreas, your suggestion to expose the mirror write and mirror read 'prefer' flags separately suddenly seems relevant, because I need to read data from one mirror then write data to the other without changing the layout. I could also achieve this by taking one mirror temporarily offline but that's kind of challenging on my setup... I'll think about it. |
| Comment by Andreas Dilger [ 21/Jan/22 ] |
|
I see that llapi_file_flush() is also called in mirror_extend_file(), but not in llapi_mirror_resync_many(), which would be a simple change to make, but doesn't solve the whole problem. Alternately, the resync should probably be bracketed with an llapi_get_data_version() call at the start/end so that it can verify that the data hasn't changed during resync? Otherwise, I don't see how we are certain that the file hasn't been modified since the resync started? aside: the use of llapi_get_data_version() in lsom_update_one() for the "flush client data" side effect should be replaced with llapi_file_flush() to make this explicit, and avoid problems in the future. |
| Comment by Patrick Farrell [ 21/Jan/22 ] |
|
I believe mirror resync is going to depend on the FLR state over on the MDS, that state machine with the cycle from readonly to stale, etc. So it doesn't need to verify anything - It controls the state until the resync is complete, then once it's complete, well, things can change after that. That's just what I believe, though. |
| Comment by Patrick Farrell [ 21/Jan/22 ] |
|
OK, I (finally) found it. We're fine for reads, though I still need to double check on writes. The comment in ll_readpage needs to be corrected. I noticed when a mirror became stale, the data stayed in cache. But the stale data was being reliably destroyed at the start of the next I/O, which confused me. I finally figured it out. When the layout lock has been lost, when we next open the file (ie, before we go to read it), we end up calling in to ll_iget - it looks like we clear some part of the object metadata, I'm not quite sure what - and that calls in to the layout conf code. That eventually calls in to lov_layout_change, which calls cl_object_prune and destroys all of the OSC objects associated with the file, which destroys the cached pages. All before the 'read' code starts. This should also work for a file which is held open because I believe Lustre should force a dentry refresh, etc, which should have the same effect? I'll talk about writes in a second. |
| Comment by Patrick Farrell [ 21/Jan/22 ] |
|
As I reflect on pending writes, I realize I forgot the basic FLR lifecycle. If there is a pending write (ie, dirty data), the mirrors can't be in sync. The I/O which created that dirty data brings the mirrors out of sync. And bringing them back in to sync does flush all the data - that's a requirement when implementing lfs mirror resync. So writes are fine too. I think this is all fine except that comment is misleading (and probably a leftover from an earlier design). |
| Comment by Alex Zhuravlev [ 22/Jan/22 ] |
|
please check the original description - it was about early setting of PG_uptodate before the 2nd layout check. if the 2nd check is not required, then why is it here? |
| Comment by Andreas Dilger [ 22/Jan/22 ] |
So this depends on the client being properly notified about the layout lock and canceling its own cache. It would be more robust if the OST extent lock was revoked from objects in the stale mirror(s) at the time they are marked stale, so that the client couldn't continue to read from the old objects. This won't help in the case of a totally isolated client (i.e. not getting callbacks from either MDS or OSS), but it does help in the case of the client being evicted/incommunicado with the MDS, but still able to contact the OSS to read data. |
| Comment by Gerrit Updater [ 22/Jan/22 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46273 |
| Comment by Alex Zhuravlev [ 22/Jan/22 ] |
|
please, have a look at https://review.whamcloud.com/#/c/46273/ [ 48.723121] Lustre: Mounted lustre-client AAAA written: osd-ldiskfs.lustre-OST0000.nonrotational=0 osd-ldiskfs.lustre-OST0001.nonrotational=0 osd-ldiskfs.lustre-OST0001.nonrotational=1 fail_loc=0x16f fail_val=4 create replicate in background [ 49.049398] LustreError: 5192:0:(fail.c:138:__cfs_fail_timeout_set()) cfs_fail_timeout id 16f sleeping for 4ms [ 49.150019] LustreError: 5192:0:(fail.c:149:__cfs_fail_timeout_set()) cfs_fail_timeout id 16f awake [ 49.154344] lustre-OST0000: OST_READ from [0x100000000:0x3:0x0] read right before layout merge, after data flush [ 49.549402] lustre-OST0001: OST_READ from [0x100010000:0x2:0x0] wait for replication replication done BBBB written fast read from pagecache DATA: AAAA DATA: AAAA read via new open(2) [ 51.576003] lustre-OST0000: OST_READ from [0x100000000:0x3:0x0] DATA: BBBB [ 51.586037] lustre-OST0000: OST_READ from [0x100000000:0x3:0x0] DATA: BBBB |
| Comment by Alex Zhuravlev [ 23/Jan/22 ] |
|
here is another test, with no injections with cfs_fail:
test_2000() {
local tf=$DIR/$tfile
local osts=$(comma_list $(osts_nodes))
mkdir -p $MOUNT2 && mount_client $MOUNT2
# to make replica on ost1 preferred for new writes
do_nodes $osts \
$LCTL set_param osd*.*OST*.nonrotational=0
do_nodes $osts \
$LCTL set_param osd*.*OST0001*.nonrotational=1
$LFS setstripe -c1 -i0 $tf || errro "can't create $tf"
echo "AAAA" >$tf
$LFS mirror extend -N -o1 $tf || error "can't make replica"
log2 "replicated file created"
cancel_lru_locks mdc
cancel_lru_locks osc
log2 "open(O_RDONLY) and first read from OST"
$MULTIOP $tf oO_RDONLY:P4_z0P4_z0P4c &
PID=$!
sleep 1
log2 "first read complete"
echo "BBBB" >/tmp//$tfile
dd if=/tmp/$tfile of=$DIR2/$tfile conv=notrunc >&/dev/null ||
error "can't write BBBB"
log2 "BBBB written which made replica on ost1 stale"
log2 "fast read from pagecache in the original process"
kill -USR1 $PID
sleep 1
log2 "read via $DIR2 new open(2)"
$MULTIOP $DIR2/$tfile oO_RDONLY:P4c
log2 "fast read from pagecache after 5s in the original process"
sleep 5
kill -USR1 $PID
wait $PID
log2 "read via $DIR new open(2)"
$MULTIOP $tf oO_RDONLY:P4c
}
run_test 2000 "test2"
and the output: == sanity-flr test 2000: test2 =========================== 20:26:32 (1642969592) [ 47.814975] Lustre: DEBUG MARKER: == sanity-flr test 2000: test2 =========================== 20:26:32 (1642969592) Starting client: tmp.y4sbWzerEx: -o user_xattr,flock tmp.y4sbWzerEx@tcp:/lustre /mnt/lustre2 [ 47.982943] Lustre: Mounted lustre-client osd-ldiskfs.lustre-OST0000.nonrotational=0 osd-ldiskfs.lustre-OST0001.nonrotational=0 osd-ldiskfs.lustre-OST0001.nonrotational=1 [ 48.133925] Lustre: DEBUG MARKER: replicated file created [ 48.175340] Lustre: DEBUG MARKER: open(O_RDONLY) and first read from OST DATA: AAAA in 6318 [ 49.196269] Lustre: DEBUG MARKER: first read complete [ 49.222184] Lustre: DEBUG MARKER: BBBB written which made replica on ost1 stale [ 49.229754] Lustre: DEBUG MARKER: fast read from pagecache in the original process DATA: AAAA in 6318 [ 50.245148] Lustre: DEBUG MARKER: read via /mnt/lustre2 new open(2) DATA: BBBB in 6326 [ 50.261963] Lustre: DEBUG MARKER: fast read from pagecache after 5s in the original process DATA: AAAA in 6318 [ 55.279692] Lustre: DEBUG MARKER: read via new open(2) DATA: BBBB in 6330 notice read via 2nd client (/mnt/lustre2) gets BBBB and still after that the first client reads AAAA again (I checked the logs - it was fast read path). I used a trivial change to multiop: command P reads <num> bytes and prints as a string. |
| Comment by Gerrit Updater [ 24/Jan/22 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46282 |
| Comment by Alex Zhuravlev [ 24/Jan/22 ] |
I think there are two options:
|
| Comment by Colin Faber [ 25/Jul/22 ] |
|
Hi bzzz What are the next steps here? I thought we went with option 2, however with your patch above, shadow is saying -1. Can you please advise? |
| Comment by Alex Zhuravlev [ 26/Jul/22 ] |
it'd be great it Patrick could step in and review the patch. |
| Comment by Patrick Farrell [ 05/Jul/23 ] |
|
I'm not 100% sure we've closed all the races here, but I think the patch is good. Just noting we might still have something that could happen, not 100% sure. |
| Comment by Alex Zhuravlev [ 10/Jul/23 ] |
|
I'm 100% sure hte patch closes at least one hole |
| Comment by Gerrit Updater [ 14/Jul/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/46282/ |
| Comment by Peter Jones [ 14/Jul/23 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 27/Nov/23 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53247 |