[LU-12328] FLR mirroring on 2.12.1-1 not usable if OST is down Created: 22/May/19 Updated: 25/Nov/19 Resolved: 18/Oct/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.1 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Joe Frith | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL 7.6 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
See below for stripe details on the file "mirror10". If OST idx 1 is unmounted and made unavailable, performance drops down to 1/10th of expected performance. The client has to timeout on OST idx1 before it tries to read from OST idx 7. This happens for each 1MB block as that is the block size being used resulting in very poor performance.
$ lfs getstripe mirror10
mirror10
lcm_layout_gen: 5
lcm_mirror_count: 2
lcm_entry_count: 2
lcme_id: 65537
lcme_mirror_id: 1
lcme_flags: init
lcme_extent.e_start: 0
lcme_extent.e_end: EOF
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 1
lmm_pool: 01
lmm_objects:
- 0: { l_ost_idx: 1, l_fid: [0x100010000:0x280a8:0x0] }
lcme_id: 131074
lcme_mirror_id: 2
lcme_flags: init
lcme_extent.e_start: 0
lcme_extent.e_end: EOF
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 7
lmm_pool: 02
lmm_objects:
- 0: { l_ost_idx: 7, l_fid: [0x100070000:0x28066:0x0] }
|
| Comments |
| Comment by Zhenyu Xu [ 24/May/19 ] |
|
When the client read path encountered read error, it would retry another mirror, and do nothing to mark the mirror/replica unavailable. We could mark the mirror's unavailability in the memory so that until the inode is dropped from cache, read will try avoid reading the unavailable mirror. |
| Comment by Gerrit Updater [ 24/May/19 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/34952 |
| Comment by Patrick Farrell (Inactive) [ 24/May/19 ] |
|
Jinshan pointed out that in the simple OST offline case, he doesn't think the client should keep timing out every request: We should probably still have a concept of a mirror being unhealthy for when something is intermittent but not fully offline. |
| Comment by Andreas Dilger [ 24/May/19 ] |
|
I guess the question is whether FLR expects the underlying OSC to fail the RPC quickly if the OST is offline, so that the upper layers do not need to track this themselves? |
| Comment by Andreas Dilger [ 24/May/19 ] |
|
Tejas, can you please also attach your log files here. |
| Comment by Joe Frith [ 25/May/19 ] |
|
I have attached the log file. |
| Comment by Zhenyu Xu [ 25/May/19 ] |
|
From the log I can see that for every 1MB read, the lov_io's lis_mirror_index is cleared so that the read try start reading from mirror 0 again. In every ll_file_aio_read() iteration, the lu_env is a new one so that lov_io::lis_mirror_index does not have its old value I think. |
| Comment by Andreas Dilger [ 25/May/19 ] |
|
Alex's patch to move lu_env into a global hash indexes by the process could help this case, so that lu_env stays around over multiple syscalls. However, there would be lifetime issues for how long we need to keep the lu_env after a syscall completes, since we would not be notified when a process exits... We could tie the state to the open file handle instead of the lu_env, which would be an improvement (file handles disappear when the process exits), but keeping it in the file layout (in memory) as suggested on the patch would be even better. This state affects all threads accessing that file, so there is no reason to keep it in a per process structure like lu_env or file descriptor and have to re-discover it for each thread (or syscall!) accessing that file. My top preference is that we track the state on the OSC itself, since this is really global to all files stored on the OST. That is essentially what the "active" state is. |
| Comment by Jinshan Xiong [ 26/May/19 ] |
|
I realized this piece of code has been removed from review: 247 /* move replica index to the next one */
248 spin_lock(&comp->lo_write_lock);
249 if (index == comp->lo_preferred_replica) {
250 do {
251 index = (index + 1) % comp->lo_entry_count;
252 if (comp->lo_entries[index].lle_valid)
253 break;
254 } while (index != comp->lo_preferred_replica);
255
256 /* reset preferred replica so that other threads can
257 * take advantage of our retries. */
258 comp->lo_preferred_replica = index;
259 } else {
260 /* preferred index was moved by other thread */
261 index = comp->lo_preferred_replica;
262 }
263 spin_unlock(&comp->lo_write_lock);
264
265 CDEBUG(D_VFSTRACE, DFID " move replica from %d to %d, "
266 "have retried: %d\n", PFID(lu_object_fid(lov2lu(obj))),
267 lio->lis_replica_index, index, io->ci_ndelay_tried);
268
269 lio->lis_replica_index = index;
This piece of code just tracks the last successful mirror. Right now, since lo_preferred_mirror is also used by write, we should add a new field called lo_last_success_mirror or similar to track the last success mirror. Tying the last success mirror to open file seems to have a problem that different opening files would use different mirror therefore the same piece of data would be cached more than once. Meanwhile, we should revise the commit: commit 5a6ceb664f07812c351786c1043da71ff5027f8c
Author: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Date: Mon Sep 28 16:50:15 2015 +0300
LU-7236 ptlrpc: idle connections can disconnect
to honor rq_no_delay in the RPC |
| Comment by Andreas Dilger [ 28/May/19 ] |
JInshan, I don't see any record of that code on master. Was it maybe only in a patch on the FLR branch, or only in your local checkout? Alex, any thoughts on how to fix the issue Jinshan describes? |
| Comment by Joe Frith [ 30/May/19 ] |
|
This is a major issue for us as we cannot continue with maintenance unless this issue is fixed. Will this be included in the next release 2.12.3? Any time-frame on this? |
| Comment by Alex Zhuravlev [ 30/May/19 ] |
|
> to honor rq_no_delay in the RPC |
| Comment by Andreas Dilger [ 30/May/19 ] |
|
Alex, ideally there would be a two-stage approach for FLR. For reads it would try whichever OST is preferred. If the OSC is offline then it could be skipped initially, and the read go to the other mirror copies if the OSCs are online. If none are online, then it should wait on the preferred OSC. For writes, the MDS selects which replica should be used, so the client will have to wait until the OSC is connected again. |
| Comment by Alex Zhuravlev [ 30/May/19 ] |
|
that logic would preevnt balancing? |
| Comment by Jinshan Xiong [ 31/May/19 ] |
Yes, I found this piece of code from my local branch. That version of patch was an earlier version of implementation. Notice that the name was 'replica' at that time. I believe it should be in one of abandoned patch. If the connection is in IDLE state, probably it should return immediately if the RPC has `rq_no_delay` set, and I tend to think it also should kick off the reconnection asynchronously. In the current implementation of FLR, it iterates all mirrors until it finds one available to read. If none is available, it will wait for 10ms and restart trying. Hopefully, it would find that one mirror that becomes available to read if it kicks off reconnect earlier. |
| Comment by Alex Zhuravlev [ 03/Jun/19 ] |
|
in terms of latency it makes sense to use first/any available target. in terms of throughput it makes sense to balance I/O among the targets. thus I guess the code should be able to detect the point where IO becomes "massive" for specific object and then use idling connections, but not sooner? |
| Comment by Jinshan Xiong [ 03/Jun/19 ] |
|
Throughput from a single node has never been a goal for FLR, so that current logic is to find an available mirror and stick with that one. And yes, I think we should not block `rq_no_delay`. Let's put the IDLE connection away a little bit first - if I remember it correctly, the user is actually experiencing a problem that the connection is in DISCON, should we fix it first? |
| Comment by Andreas Dilger [ 03/Jun/19 ] |
|
Alex, I definitely have some ideas on client-side read policy, in order to maximize global throughput vs. single-client throughput, from LDEV-436:
|
| Comment by Alex Zhuravlev [ 04/Jun/19 ] |
|
I'm not against changing semantics of rq_no_delay, but it should be noticed that another users (like lfs df) would need to be changed then where the callers wants to try to connect at least once before giving up. |
| Comment by Gerrit Updater [ 08/Jun/19 ] |
|
Jinshan Xiong (jinshan.xiong@gmail.com) uploaded a new patch: https://review.whamcloud.com/35111 |
| Comment by Joe Frith [ 25/Jun/19 ] |
|
Will this patch be included in Lustre 2.12.3? We are delaying maintenance because of this issue. |
| Comment by Andreas Dilger [ 26/Jun/19 ] |
|
As yet the patch has not landed on master, so that would need to happen before it can land to b2_12. |
| Comment by Gerrit Updater [ 07/Jul/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35111/ |
| Comment by Gerrit Updater [ 09/Jul/19 ] |
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35450 |
| Comment by Gerrit Updater [ 09/Jul/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35450/ |
| Comment by Joe Frith [ 30/Aug/19 ] |
|
Did the patch get reverted after being included in the master?
We are still hoping that this issue gets resolved so we can go ahead with the maintenance. |
| Comment by Andreas Dilger [ 31/Aug/19 ] |
|
The patch was reverted because it was causing frequent crashes in testing ( The original patch https://review.whamcloud.com/34952 " |
| Comment by Joe Frith [ 16/Sep/19 ] |
|
Any chance this will get included in the 2.12.3? We are stuck and cannot migrate due to this issue. |
| Comment by Andreas Dilger [ 08/Oct/19 ] |
|
Tejas, I don't think this patch will make it into 2.12.3, because it hasn't yet landed to master. However, I think the current version of the patch is in good enough shape for you to test. It would be useful if you could give the latest patch a try and let us know if this is working for you. |
| Comment by Joe Frith [ 11/Oct/19 ] |
|
I did a quick test and the patch seems to work as expected, read performance is as expected with an unhealthy mirror. I did not however run compressive tests. |
| Comment by Gerrit Updater [ 18/Oct/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34952/ |
| Comment by Peter Jones [ 18/Oct/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 22/Oct/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36550 |
| Comment by Gerrit Updater [ 21/Nov/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36550/ |