[LU-12328] FLR mirroring on 2.12.1-1 not usable if OST is down Created: 22/May/19  Updated: 25/Nov/19  Resolved: 18/Oct/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.1
Fix Version/s: Lustre 2.13.0, Lustre 2.12.4

Type: Bug Priority: Major
Reporter: Joe Frith Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

RHEL 7.6


Attachments: Zip Archive log.zip    
Issue Links:
Related
is related to LU-12525 sanity-flr test 200 and others aserti... Resolved
is related to LU-7236 connections on demand Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

See below for stripe details on the file "mirror10". If OST idx 1 is unmounted and made unavailable, performance drops down to 1/10th of expected performance. The client has to timeout on OST idx1 before it tries to read from OST idx 7. This happens for each 1MB block as that is the block size being used resulting in very poor performance. 

 

 

$ lfs getstripe mirror10
mirror10
 lcm_layout_gen: 5
 lcm_mirror_count: 2
 lcm_entry_count: 2
 lcme_id: 65537
 lcme_mirror_id: 1
 lcme_flags: init
 lcme_extent.e_start: 0
 lcme_extent.e_end: EOF
 lmm_stripe_count: 1
 lmm_stripe_size: 1048576
 lmm_pattern: raid0
 lmm_layout_gen: 0
 lmm_stripe_offset: 1
 lmm_pool: 01
 lmm_objects:
 - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x280a8:0x0] }

lcme_id: 131074
 lcme_mirror_id: 2
 lcme_flags: init
 lcme_extent.e_start: 0
 lcme_extent.e_end: EOF
 lmm_stripe_count: 1
 lmm_stripe_size: 1048576
 lmm_pattern: raid0
 lmm_layout_gen: 0
 lmm_stripe_offset: 7
 lmm_pool: 02
 lmm_objects:
 - 0: { l_ost_idx: 7, l_fid: [0x100070000:0x28066:0x0] }


 Comments   
Comment by Zhenyu Xu [ 24/May/19 ]

When the client read path encountered read error, it would retry another mirror, and do nothing to mark the mirror/replica unavailable. We could mark the mirror's unavailability in the memory so that until the inode is dropped from cache, read will try avoid reading the unavailable mirror.

Comment by Gerrit Updater [ 24/May/19 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/34952
Subject: LU-12328 flr: mark avoiding mirror in read
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6e9139178be3a37f60e39aaabb87871788e9844a

Comment by Patrick Farrell (Inactive) [ 24/May/19 ]

Jinshan pointed out that in the simple OST offline case, he doesn't think the client should keep timing out every request:
"hmm.. This definitely is not expected. As long as ost 1 is down, it should be returned immediately from OSC layer and tries to read the 2nd mirror that is located on ost 7. For the following blocks, it should not even try ost1 but go to 7 directly.
 
Would you please collect Lustre log and send it to me? You can collect logs on client side as follows:
0. create mirrored file
1. lctl set_param debug=-1 && lctl clear
2. lctl mark "======= start ========"
3. read the file
4. lctl dk > log.txt
 
and send me the log.txt file. If you can reproduce this problem consistently, please use a small file so that it would be easier to check the log.
 
Jinshan"

We should probably still have a concept of a mirror being unhealthy for when something is intermittent but not fully offline.

Comment by Andreas Dilger [ 24/May/19 ]

I guess the question is whether FLR expects the underlying OSC to fail the RPC quickly if the OST is offline, so that the upper layers do not need to track this themselves?

Comment by Andreas Dilger [ 24/May/19 ]

Tejas, can you please also attach your log files here.

Comment by Joe Frith [ 25/May/19 ]

I have attached the log file.  

Comment by Zhenyu Xu [ 25/May/19 ]

From the log I can see that for every 1MB read, the lov_io's lis_mirror_index is cleared so that the read try start reading from mirror 0 again.

In every ll_file_aio_read() iteration, the lu_env is a new one so that lov_io::lis_mirror_index does not have its old value I think.

Comment by Andreas Dilger [ 25/May/19 ]

Alex's patch to move lu_env into a global hash indexes by the process could help this case, so that lu_env stays around over multiple syscalls. However, there would be lifetime issues for how long we need to keep the lu_env after a syscall completes, since we would not be notified when a process exits...

We could tie the state to the open file handle instead of the lu_env, which would be an improvement (file handles disappear when the process exits), but keeping it in the file layout (in memory) as suggested on the patch would be even better. This state affects all threads accessing that file, so there is no reason to keep it in a per process structure like lu_env or file descriptor and have to re-discover it for each thread (or syscall!) accessing that file.

My top preference is that we track the state on the OSC itself, since this is really global to all files stored on the OST. That is essentially what the "active" state is.

Comment by Jinshan Xiong [ 26/May/19 ]

I realized this piece of code has been removed from review:

 247                 /* move replica index to the next one */
 248                 spin_lock(&comp->lo_write_lock);
 249                 if (index == comp->lo_preferred_replica) {
 250                         do {
 251                                 index = (index + 1) % comp->lo_entry_count;
 252                                 if (comp->lo_entries[index].lle_valid)
 253                                         break;
 254                         } while (index != comp->lo_preferred_replica);
 255
 256                         /* reset preferred replica so that other threads can
 257                          * take advantage of our retries. */
 258                         comp->lo_preferred_replica = index;
 259                 } else {
 260                         /* preferred index was moved by other thread */
 261                         index = comp->lo_preferred_replica;
 262                 }
 263                 spin_unlock(&comp->lo_write_lock);
 264
 265                 CDEBUG(D_VFSTRACE, DFID " move replica from %d to %d, "
 266                        "have retried: %d\n", PFID(lu_object_fid(lov2lu(obj))),
 267                        lio->lis_replica_index, index, io->ci_ndelay_tried);
 268
 269                 lio->lis_replica_index = index;

This piece of code just tracks the last successful mirror. Right now, since lo_preferred_mirror is also used by write, we should add a new field called lo_last_success_mirror or similar to track the last success mirror.

Tying the last success mirror to open file seems to have a problem that different opening files would use different mirror therefore the same piece of data would be cached more than once.

Meanwhile, we should revise the commit:

commit 5a6ceb664f07812c351786c1043da71ff5027f8c
Author: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Date:   Mon Sep 28 16:50:15 2015 +0300

    LU-7236 ptlrpc: idle connections can disconnect

to honor rq_no_delay in the RPC

Comment by Andreas Dilger [ 28/May/19 ]

I realized this piece of code has been removed from review:

JInshan, I don't see any record of that code on master. Was it maybe only in a patch on the FLR branch, or only in your local checkout?

Alex, any thoughts on how to fix the issue Jinshan describes?

Comment by Joe Frith [ 30/May/19 ]

This is a major issue for us as we cannot continue with maintenance unless this issue is fixed. 

Will this be included in the next release 2.12.3? Any time-frame on this?

Comment by Alex Zhuravlev [ 30/May/19 ]

> to honor rq_no_delay in the RPC
thinking how to deal with that, but the important thing is that the semantics has changed - if the connection is idle, then you have to wait, some time to try to reconnect?
for example, all the connections can be idle, do you expect an error in this case?

Comment by Andreas Dilger [ 30/May/19 ]

Alex, ideally there would be a two-stage approach for FLR. For reads it would try whichever OST is preferred. If the OSC is offline then it could be skipped initially, and the read go to the other mirror copies if the OSCs are online. If none are online, then it should wait on the preferred OSC. For writes, the MDS selects which replica should be used, so the client will have to wait until the OSC is connected again.

Comment by Alex Zhuravlev [ 30/May/19 ]

that logic would preevnt balancing?

Comment by Jinshan Xiong [ 31/May/19 ]

JInshan, I don't see any record of that code on master. Was it maybe only in a patch on the FLR branch, or only in your local checkout?

Yes, I found this piece of code from my local branch. That version of patch was an earlier version of implementation. Notice that the name was 'replica' at that time. I believe it should be in one of abandoned patch.

If the connection is in IDLE state, probably it should return immediately if the RPC has `rq_no_delay` set, and I tend to think it also should kick off the reconnection asynchronously.

In the current implementation of FLR, it iterates all mirrors until it finds one available to read. If none is available, it will wait for 10ms and restart trying. Hopefully, it would find that one mirror that becomes available to read if it kicks off reconnect earlier.

Comment by Alex Zhuravlev [ 03/Jun/19 ]

in terms of latency it makes sense to use first/any available target. in terms of throughput it makes sense to balance I/O among the targets. thus I guess the code should be able to detect the point where IO becomes "massive" for specific object and then use idling connections, but not sooner?
would hitting max-RPC-in-flight be another way to detect when balancing makes sense?
yet another thougth is that it doesn't makes sense to allocate/prepare RPC against idling connection unless we really want to use it - i.e. rq_no_delay isn't really the best interface for this kind of logic. and even when we want to enbale that connection (due to balancing), we don't want to block with rq_no_delay, but proceed with FULL one and initiate idling one?

Comment by Jinshan Xiong [ 03/Jun/19 ]

Throughput from a single node has never been a goal for FLR, so that current logic is to find an available mirror and stick with that one.

And yes, I think we should not block `rq_no_delay`.

Let's put the IDLE connection away a little bit first - if I remember it correctly, the user is actually experiencing a problem that the connection is in DISCON, should we fix it first?

Comment by Andreas Dilger [ 03/Jun/19 ]

Alex, I definitely have some ideas on client-side read policy, in order to maximize global throughput vs. single-client throughput, from LDEV-436:

In particular, it would be good if reading the same data from a file will normally read from the same OST/replica so that this can be handled from the cache, rather than using a random replica and forcing disk reads on multiple OSTs. Not only does this avoid disk activity, but it also avoids the case of a single client doing reads from multiple OSTs and needing to get DLM locks from each one.

If the file is getting very large (e.g. multi-GB), it makes sense to spread the read workload across multiple OSTs in some deterministic manner (e.g. replica count and file offset) so that there are mutliple OSTs active on the file, and at least several of the replicas active if many clients are reading at different offsets from the same file.

If there are large numbers of replicas for a single file, then the clients should spread the read workload across all of them (e.g. based on client NID), on the assumption that a user creates 10+ replicas of a file to increase the read bandwidth).

Comment by Alex Zhuravlev [ 04/Jun/19 ]

I'm not against changing semantics of rq_no_delay, but it should be noticed that another users (like lfs df) would need to be changed then where the callers wants to try to connect at least once before giving up.
given extra costs of preparing RPC I think a better/cheaper interface to check connection status is needed.

Comment by Gerrit Updater [ 08/Jun/19 ]

Jinshan Xiong (jinshan.xiong@gmail.com) uploaded a new patch: https://review.whamcloud.com/35111
Subject: LU-12328 flr: preserve last read mirror
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7f9985832de7699f06fdef2916a280a3666ca7cf

Comment by Joe Frith [ 25/Jun/19 ]

Will this patch be included in Lustre 2.12.3? We are delaying maintenance because of this issue. 

Comment by Andreas Dilger [ 26/Jun/19 ]

As yet the patch has not landed on master, so that would need to happen before it can land to b2_12.

Comment by Gerrit Updater [ 07/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35111/
Subject: LU-12328 flr: preserve last read mirror
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 810f2a5fef577b4f0f6a58ab234cf29afd96c748

Comment by Gerrit Updater [ 09/Jul/19 ]

Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35450
Subject: Revert "LU-12328 flr: preserve last read mirror"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e4788166435a05e5fe39107ebbcb167e13a74bcc

Comment by Gerrit Updater [ 09/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35450/
Subject: Revert "LU-12328 flr: preserve last read mirror"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0a8750628d9a87f686b917c88e42093a52a78ae3

Comment by Joe Frith [ 30/Aug/19 ]

Did the patch get reverted after being included in the master? 

 

We are still hoping that this issue gets resolved so we can go ahead with the maintenance. 

Comment by Andreas Dilger [ 31/Aug/19 ]

The patch was reverted because it was causing frequent crashes in testing (LU-12525).

The original patch https://review.whamcloud.com/34952 "LU-12328 flr: avoid reading unhealthy mirror" should fix the original problem, but it needs to be refreshed again.

Comment by Joe Frith [ 16/Sep/19 ]

Any chance this will get included in the 2.12.3? We are stuck and cannot migrate due to this issue. 

Comment by Andreas Dilger [ 08/Oct/19 ]

Tejas, I don't think this patch will make it into 2.12.3, because it hasn't yet landed to master.

However, I think the current version of the patch is in good enough shape for you to test. It would be useful if you could give the latest patch a try and let us know if this is working for you.

Comment by Joe Frith [ 11/Oct/19 ]

I did a quick test and the patch seems to work as expected, read performance is as expected with an unhealthy mirror. I did not however run compressive tests.  

Comment by Gerrit Updater [ 18/Oct/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34952/
Subject: LU-12328 flr: avoid reading unhealthy mirror
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 39da3c06275e04e2a6e7f055cb27ee9dff1ea576

Comment by Peter Jones [ 18/Oct/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 22/Oct/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36550
Subject: LU-12328 flr: avoid reading unhealthy mirror
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 95b0b3d9aa1d0de120e788eef96d4a1f42ff9d6c

Comment by Gerrit Updater [ 21/Nov/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36550/
Subject: LU-12328 flr: avoid reading unhealthy mirror
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 02affb11d4162f23eadef7e0ed15982e11005a41

Generated at Sat Feb 10 02:51:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.