[LU-12328] FLR mirroring on 2.12.1-1 not usable if OST is down - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.13.0, Lustre 2.12.4
Affects Version/s: Lustre 2.12.1
Labels:
None
Environment:
RHEL 7.6

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

See below for stripe details on the file "mirror10". If OST idx 1 is unmounted and made unavailable, performance drops down to 1/10th of expected performance. The client has to timeout on OST idx1 before it tries to read from OST idx 7. This happens for each 1MB block as that is the block size being used resulting in very poor performance.

$ lfs getstripe mirror10
mirror10
 lcm_layout_gen: 5
 lcm_mirror_count: 2
 lcm_entry_count: 2
 lcme_id: 65537
 lcme_mirror_id: 1
 lcme_flags: init
 lcme_extent.e_start: 0
 lcme_extent.e_end: EOF
 lmm_stripe_count: 1
 lmm_stripe_size: 1048576
 lmm_pattern: raid0
 lmm_layout_gen: 0
 lmm_stripe_offset: 1
 lmm_pool: 01
 lmm_objects:
 - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x280a8:0x0] }

lcme_id: 131074
 lcme_mirror_id: 2
 lcme_flags: init
 lcme_extent.e_start: 0
 lcme_extent.e_end: EOF
 lmm_stripe_count: 1
 lmm_stripe_size: 1048576
 lmm_pattern: raid0
 lmm_layout_gen: 0
 lmm_stripe_offset: 7
 lmm_pool: 02
 lmm_objects:
 - 0: { l_ost_idx: 7, l_fid: [0x100070000:0x28066:0x0] }

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

log.zip
546 kB
25/May/19 12:14 AM

Issue Links

is related to

LU-12525 sanity-flr test 200 and others asertion in osc_page_delete

Resolved

LU-7236 OST connect and disconnect on demand

Resolved

Activity

[LU-12328] FLR mirroring on 2.12.1-1 not usable if OST is down

Jinshan Xiong added a comment - 03/Jun/19 6:17 PM

Throughput from a single node has never been a goal for FLR, so that current logic is to find an available mirror and stick with that one.

And yes, I think we should not block `rq_no_delay`.

Let's put the IDLE connection away a little bit first - if I remember it correctly, the user is actually experiencing a problem that the connection is in DISCON, should we fix it first?

Jinshan Xiong added a comment - 03/Jun/19 6:17 PM Throughput from a single node has never been a goal for FLR, so that current logic is to find an available mirror and stick with that one. And yes, I think we should not block `rq_no_delay`. Let's put the IDLE connection away a little bit first - if I remember it correctly, the user is actually experiencing a problem that the connection is in DISCON, should we fix it first?

Alex Zhuravlev added a comment - 03/Jun/19 1:54 PM

in terms of latency it makes sense to use first/any available target. in terms of throughput it makes sense to balance I/O among the targets. thus I guess the code should be able to detect the point where IO becomes "massive" for specific object and then use idling connections, but not sooner?
would hitting max-RPC-in-flight be another way to detect when balancing makes sense?
yet another thougth is that it doesn't makes sense to allocate/prepare RPC against idling connection unless we really want to use it - i.e. rq_no_delay isn't really the best interface for this kind of logic. and even when we want to enbale that connection (due to balancing), we don't want to block with rq_no_delay, but proceed with FULL one and initiate idling one?

Alex Zhuravlev added a comment - 03/Jun/19 1:54 PM in terms of latency it makes sense to use first/any available target. in terms of throughput it makes sense to balance I/O among the targets. thus I guess the code should be able to detect the point where IO becomes "massive" for specific object and then use idling connections, but not sooner? would hitting max-RPC-in-flight be another way to detect when balancing makes sense? yet another thougth is that it doesn't makes sense to allocate/prepare RPC against idling connection unless we really want to use it - i.e. rq_no_delay isn't really the best interface for this kind of logic. and even when we want to enbale that connection (due to balancing), we don't want to block with rq_no_delay, but proceed with FULL one and initiate idling one?

Jinshan Xiong added a comment - 31/May/19 12:33 AM

JInshan, I don't see any record of that code on master. Was it maybe only in a patch on the FLR branch, or only in your local checkout?

Yes, I found this piece of code from my local branch. That version of patch was an earlier version of implementation. Notice that the name was 'replica' at that time. I believe it should be in one of abandoned patch.

If the connection is in IDLE state, probably it should return immediately if the RPC has `rq_no_delay` set, and I tend to think it also should kick off the reconnection asynchronously.

In the current implementation of FLR, it iterates all mirrors until it finds one available to read. If none is available, it will wait for 10ms and restart trying. Hopefully, it would find that one mirror that becomes available to read if it kicks off reconnect earlier.

Jinshan Xiong added a comment - 31/May/19 12:33 AM JInshan, I don't see any record of that code on master. Was it maybe only in a patch on the FLR branch, or only in your local checkout? Yes, I found this piece of code from my local branch. That version of patch was an earlier version of implementation. Notice that the name was 'replica' at that time. I believe it should be in one of abandoned patch. If the connection is in IDLE state, probably it should return immediately if the RPC has `rq_no_delay` set, and I tend to think it also should kick off the reconnection asynchronously. In the current implementation of FLR, it iterates all mirrors until it finds one available to read. If none is available, it will wait for 10ms and restart trying. Hopefully, it would find that one mirror that becomes available to read if it kicks off reconnect earlier.

Alex Zhuravlev added a comment - 30/May/19 8:13 AM

that logic would preevnt balancing?

Alex Zhuravlev added a comment - 30/May/19 8:13 AM that logic would preevnt balancing?

Andreas Dilger added a comment - 30/May/19 8:10 AM

Alex, ideally there would be a two-stage approach for FLR. For reads it would try whichever OST is preferred. If the OSC is offline then it could be skipped initially, and the read go to the other mirror copies if the OSCs are online. If none are online, then it should wait on the preferred OSC. For writes, the MDS selects which replica should be used, so the client will have to wait until the OSC is connected again.

Andreas Dilger added a comment - 30/May/19 8:10 AM Alex, ideally there would be a two-stage approach for FLR. For reads it would try whichever OST is preferred. If the OSC is offline then it could be skipped initially, and the read go to the other mirror copies if the OSCs are online. If none are online, then it should wait on the preferred OSC. For writes, the MDS selects which replica should be used, so the client will have to wait until the OSC is connected again.

Alex Zhuravlev added a comment - 30/May/19 4:03 AM

> to honor rq_no_delay in the RPC
thinking how to deal with that, but the important thing is that the semantics has changed - if the connection is idle, then you have to wait, some time to try to reconnect?
for example, all the connections can be idle, do you expect an error in this case?

Alex Zhuravlev added a comment - 30/May/19 4:03 AM > to honor rq_no_delay in the RPC thinking how to deal with that, but the important thing is that the semantics has changed - if the connection is idle, then you have to wait, some time to try to reconnect? for example, all the connections can be idle, do you expect an error in this case?

Joe Frith added a comment - 30/May/19 12:22 AM

This is a major issue for us as we cannot continue with maintenance unless this issue is fixed.

Will this be included in the next release 2.12.3? Any time-frame on this?

Joe Frith added a comment - 30/May/19 12:22 AM This is a major issue for us as we cannot continue with maintenance unless this issue is fixed. Will this be included in the next release 2.12.3? Any time-frame on this?

Andreas Dilger added a comment - 28/May/19 4:31 AM

I realized this piece of code has been removed from review:

JInshan, I don't see any record of that code on master. Was it maybe only in a patch on the FLR branch, or only in your local checkout?

Alex, any thoughts on how to fix the issue Jinshan describes?

Andreas Dilger added a comment - 28/May/19 4:31 AM I realized this piece of code has been removed from review: JInshan, I don't see any record of that code on master. Was it maybe only in a patch on the FLR branch, or only in your local checkout? Alex, any thoughts on how to fix the issue Jinshan describes?

Jinshan Xiong added a comment - 26/May/19 4:04 AM - edited

I realized this piece of code has been removed from review:

 247                 /* move replica index to the next one */
 248                 spin_lock(&comp->lo_write_lock);
 249                 if (index == comp->lo_preferred_replica) {
 250                         do {
 251                                 index = (index + 1) % comp->lo_entry_count;
 252                                 if (comp->lo_entries[index].lle_valid)
 253                                         break;
 254                         } while (index != comp->lo_preferred_replica);
 255
 256                         /* reset preferred replica so that other threads can
 257                          * take advantage of our retries. */
 258                         comp->lo_preferred_replica = index;
 259                 } else {
 260                         /* preferred index was moved by other thread */
 261                         index = comp->lo_preferred_replica;
 262                 }
 263                 spin_unlock(&comp->lo_write_lock);
 264
 265                 CDEBUG(D_VFSTRACE, DFID " move replica from %d to %d, "
 266                        "have retried: %d\n", PFID(lu_object_fid(lov2lu(obj))),
 267                        lio->lis_replica_index, index, io->ci_ndelay_tried);
 268
 269                 lio->lis_replica_index = index;

This piece of code just tracks the last successful mirror. Right now, since lo_preferred_mirror is also used by write, we should add a new field called lo_last_success_mirror or similar to track the last success mirror.

Tying the last success mirror to open file seems to have a problem that different opening files would use different mirror therefore the same piece of data would be cached more than once.

Meanwhile, we should revise the commit:

commit 5a6ceb664f07812c351786c1043da71ff5027f8c
Author: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Date:   Mon Sep 28 16:50:15 2015 +0300

    LU-7236 ptlrpc: idle connections can disconnect

to honor rq_no_delay in the RPC

Jinshan Xiong added a comment - 26/May/19 4:04 AM - edited I realized this piece of code has been removed from review: 247 /* move replica index to the next one */ 248 spin_lock(&comp->lo_write_lock); 249 if (index == comp->lo_preferred_replica) { 250 do { 251 index = (index + 1) % comp->lo_entry_count; 252 if (comp->lo_entries[index].lle_valid) 253 break; 254 } while (index != comp->lo_preferred_replica); 255 256 /* reset preferred replica so that other threads can 257 * take advantage of our retries. */ 258 comp->lo_preferred_replica = index; 259 } else { 260 /* preferred index was moved by other thread */ 261 index = comp->lo_preferred_replica; 262 } 263 spin_unlock(&comp->lo_write_lock); 264 265 CDEBUG(D_VFSTRACE, DFID " move replica from %d to %d, " 266 "have retried: %d\n", PFID(lu_object_fid(lov2lu(obj))), 267 lio->lis_replica_index, index, io->ci_ndelay_tried); 268 269 lio->lis_replica_index = index; This piece of code just tracks the last successful mirror. Right now, since lo_preferred_mirror is also used by write, we should add a new field called lo_last_success_mirror or similar to track the last success mirror. Tying the last success mirror to open file seems to have a problem that different opening files would use different mirror therefore the same piece of data would be cached more than once. Meanwhile, we should revise the commit: commit 5a6ceb664f07812c351786c1043da71ff5027f8c Author: Alex Zhuravlev <alexey.zhuravlev@intel.com> Date: Mon Sep 28 16:50:15 2015 +0300 LU-7236 ptlrpc: idle connections can disconnect to honor rq_no_delay in the RPC

Andreas Dilger added a comment - 25/May/19 6:21 AM

Alex's patch to move lu_env into a global hash indexes by the process could help this case, so that lu_env stays around over multiple syscalls. However, there would be lifetime issues for how long we need to keep the lu_env after a syscall completes, since we would not be notified when a process exits...

We could tie the state to the open file handle instead of the lu_env, which would be an improvement (file handles disappear when the process exits), but keeping it in the file layout (in memory) as suggested on the patch would be even better. This state affects all threads accessing that file, so there is no reason to keep it in a per process structure like lu_env or file descriptor and have to re-discover it for each thread (or syscall!) accessing that file.

My top preference is that we track the state on the OSC itself, since this is really global to all files stored on the OST. That is essentially what the "active" state is.

Andreas Dilger added a comment - 25/May/19 6:21 AM Alex's patch to move lu_env into a global hash indexes by the process could help this case, so that lu_env stays around over multiple syscalls. However, there would be lifetime issues for how long we need to keep the lu_env after a syscall completes, since we would not be notified when a process exits... We could tie the state to the open file handle instead of the lu_env, which would be an improvement (file handles disappear when the process exits), but keeping it in the file layout (in memory) as suggested on the patch would be even better. This state affects all threads accessing that file, so there is no reason to keep it in a per process structure like lu_env or file descriptor and have to re-discover it for each thread (or syscall!) accessing that file. My top preference is that we track the state on the OSC itself, since this is really global to all files stored on the OST. That is essentially what the "active" state is.

Zhenyu Xu added a comment - 25/May/19 3:52 AM

From the log I can see that for every 1MB read, the lov_io's lis_mirror_index is cleared so that the read try start reading from mirror 0 again.

In every ll_file_aio_read() iteration, the lu_env is a new one so that lov_io::lis_mirror_index does not have its old value I think.

Zhenyu Xu added a comment - 25/May/19 3:52 AM From the log I can see that for every 1MB read, the lov_io's lis_mirror_index is cleared so that the read try start reading from mirror 0 again. In every ll_file_aio_read() iteration, the lu_env is a new one so that lov_io::lis_mirror_index does not have its old value I think.

People

Assignee:: Zhenyu Xu

Reporter:: Joe Frith

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 22/May/19 7:24 PM

Updated:: 25/Nov/19 9:18 PM

Resolved:: 18/Oct/19 1:09 AM