[LU-12328] FLR mirroring on 2.12.1-1 not usable if OST is down - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.13.0, Lustre 2.12.4
Affects Version/s: Lustre 2.12.1
Labels:
None
Environment:
RHEL 7.6

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

See below for stripe details on the file "mirror10". If OST idx 1 is unmounted and made unavailable, performance drops down to 1/10th of expected performance. The client has to timeout on OST idx1 before it tries to read from OST idx 7. This happens for each 1MB block as that is the block size being used resulting in very poor performance.

$ lfs getstripe mirror10
mirror10
 lcm_layout_gen: 5
 lcm_mirror_count: 2
 lcm_entry_count: 2
 lcme_id: 65537
 lcme_mirror_id: 1
 lcme_flags: init
 lcme_extent.e_start: 0
 lcme_extent.e_end: EOF
 lmm_stripe_count: 1
 lmm_stripe_size: 1048576
 lmm_pattern: raid0
 lmm_layout_gen: 0
 lmm_stripe_offset: 1
 lmm_pool: 01
 lmm_objects:
 - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x280a8:0x0] }

lcme_id: 131074
 lcme_mirror_id: 2
 lcme_flags: init
 lcme_extent.e_start: 0
 lcme_extent.e_end: EOF
 lmm_stripe_count: 1
 lmm_stripe_size: 1048576
 lmm_pattern: raid0
 lmm_layout_gen: 0
 lmm_stripe_offset: 7
 lmm_pool: 02
 lmm_objects:
 - 0: { l_ost_idx: 7, l_fid: [0x100070000:0x28066:0x0] }

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

log.zip
546 kB
25/May/19 12:14 AM

Issue Links

is related to

LU-12525 sanity-flr test 200 and others asertion in osc_page_delete

Resolved

LU-7236 OST connect and disconnect on demand

Resolved

Activity

[LU-12328] FLR mirroring on 2.12.1-1 not usable if OST is down

Gerrit Updater added a comment - 08/Jun/19 5:35 AM

Jinshan Xiong (jinshan.xiong@gmail.com) uploaded a new patch: https://review.whamcloud.com/35111
Subject: ~~LU-12328~~ flr: preserve last read mirror
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7f9985832de7699f06fdef2916a280a3666ca7cf

Gerrit Updater added a comment - 08/Jun/19 5:35 AM Jinshan Xiong (jinshan.xiong@gmail.com) uploaded a new patch: https://review.whamcloud.com/35111 Subject: LU-12328 flr: preserve last read mirror Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7f9985832de7699f06fdef2916a280a3666ca7cf

Alex Zhuravlev added a comment - 04/Jun/19 2:47 PM

I'm not against changing semantics of rq_no_delay, but it should be noticed that another users (like lfs df) would need to be changed then where the callers wants to try to connect at least once before giving up.
given extra costs of preparing RPC I think a better/cheaper interface to check connection status is needed.

Alex Zhuravlev added a comment - 04/Jun/19 2:47 PM I'm not against changing semantics of rq_no_delay, but it should be noticed that another users (like lfs df) would need to be changed then where the callers wants to try to connect at least once before giving up. given extra costs of preparing RPC I think a better/cheaper interface to check connection status is needed.

Andreas Dilger added a comment - 03/Jun/19 11:07 PM

Alex, I definitely have some ideas on client-side read policy, in order to maximize global throughput vs. single-client throughput, from LDEV-436:

In particular, it would be good if reading the same data from a file will normally read from the same OST/replica so that this can be handled from the cache, rather than using a random replica and forcing disk reads on multiple OSTs. Not only does this avoid disk activity, but it also avoids the case of a single client doing reads from multiple OSTs and needing to get DLM locks from each one.

If the file is getting very large (e.g. multi-GB), it makes sense to spread the read workload across multiple OSTs in some deterministic manner (e.g. replica count and file offset) so that there are mutliple OSTs active on the file, and at least several of the replicas active if many clients are reading at different offsets from the same file.

If there are large numbers of replicas for a single file, then the clients should spread the read workload across all of them (e.g. based on client NID), on the assumption that a user creates 10+ replicas of a file to increase the read bandwidth).

Andreas Dilger added a comment - 03/Jun/19 11:07 PM Alex, I definitely have some ideas on client-side read policy, in order to maximize global throughput vs. single-client throughput, from LDEV-436: In particular, it would be good if reading the same data from a file will normally read from the same OST/replica so that this can be handled from the cache, rather than using a random replica and forcing disk reads on multiple OSTs. Not only does this avoid disk activity, but it also avoids the case of a single client doing reads from multiple OSTs and needing to get DLM locks from each one. If the file is getting very large (e.g. multi-GB), it makes sense to spread the read workload across multiple OSTs in some deterministic manner (e.g. replica count and file offset) so that there are mutliple OSTs active on the file, and at least several of the replicas active if many clients are reading at different offsets from the same file. If there are large numbers of replicas for a single file, then the clients should spread the read workload across all of them (e.g. based on client NID), on the assumption that a user creates 10+ replicas of a file to increase the read bandwidth).

Jinshan Xiong added a comment - 03/Jun/19 6:17 PM

Throughput from a single node has never been a goal for FLR, so that current logic is to find an available mirror and stick with that one.

And yes, I think we should not block `rq_no_delay`.

Let's put the IDLE connection away a little bit first - if I remember it correctly, the user is actually experiencing a problem that the connection is in DISCON, should we fix it first?

Jinshan Xiong added a comment - 03/Jun/19 6:17 PM Throughput from a single node has never been a goal for FLR, so that current logic is to find an available mirror and stick with that one. And yes, I think we should not block `rq_no_delay`. Let's put the IDLE connection away a little bit first - if I remember it correctly, the user is actually experiencing a problem that the connection is in DISCON, should we fix it first?

Alex Zhuravlev added a comment - 03/Jun/19 1:54 PM

in terms of latency it makes sense to use first/any available target. in terms of throughput it makes sense to balance I/O among the targets. thus I guess the code should be able to detect the point where IO becomes "massive" for specific object and then use idling connections, but not sooner?
would hitting max-RPC-in-flight be another way to detect when balancing makes sense?
yet another thougth is that it doesn't makes sense to allocate/prepare RPC against idling connection unless we really want to use it - i.e. rq_no_delay isn't really the best interface for this kind of logic. and even when we want to enbale that connection (due to balancing), we don't want to block with rq_no_delay, but proceed with FULL one and initiate idling one?

Alex Zhuravlev added a comment - 03/Jun/19 1:54 PM in terms of latency it makes sense to use first/any available target. in terms of throughput it makes sense to balance I/O among the targets. thus I guess the code should be able to detect the point where IO becomes "massive" for specific object and then use idling connections, but not sooner? would hitting max-RPC-in-flight be another way to detect when balancing makes sense? yet another thougth is that it doesn't makes sense to allocate/prepare RPC against idling connection unless we really want to use it - i.e. rq_no_delay isn't really the best interface for this kind of logic. and even when we want to enbale that connection (due to balancing), we don't want to block with rq_no_delay, but proceed with FULL one and initiate idling one?

Jinshan Xiong added a comment - 31/May/19 12:33 AM

JInshan, I don't see any record of that code on master. Was it maybe only in a patch on the FLR branch, or only in your local checkout?

Yes, I found this piece of code from my local branch. That version of patch was an earlier version of implementation. Notice that the name was 'replica' at that time. I believe it should be in one of abandoned patch.

If the connection is in IDLE state, probably it should return immediately if the RPC has `rq_no_delay` set, and I tend to think it also should kick off the reconnection asynchronously.

In the current implementation of FLR, it iterates all mirrors until it finds one available to read. If none is available, it will wait for 10ms and restart trying. Hopefully, it would find that one mirror that becomes available to read if it kicks off reconnect earlier.

Jinshan Xiong added a comment - 31/May/19 12:33 AM JInshan, I don't see any record of that code on master. Was it maybe only in a patch on the FLR branch, or only in your local checkout? Yes, I found this piece of code from my local branch. That version of patch was an earlier version of implementation. Notice that the name was 'replica' at that time. I believe it should be in one of abandoned patch. If the connection is in IDLE state, probably it should return immediately if the RPC has `rq_no_delay` set, and I tend to think it also should kick off the reconnection asynchronously. In the current implementation of FLR, it iterates all mirrors until it finds one available to read. If none is available, it will wait for 10ms and restart trying. Hopefully, it would find that one mirror that becomes available to read if it kicks off reconnect earlier.

Alex Zhuravlev added a comment - 30/May/19 8:13 AM

that logic would preevnt balancing?

Alex Zhuravlev added a comment - 30/May/19 8:13 AM that logic would preevnt balancing?

Andreas Dilger added a comment - 30/May/19 8:10 AM

Alex, ideally there would be a two-stage approach for FLR. For reads it would try whichever OST is preferred. If the OSC is offline then it could be skipped initially, and the read go to the other mirror copies if the OSCs are online. If none are online, then it should wait on the preferred OSC. For writes, the MDS selects which replica should be used, so the client will have to wait until the OSC is connected again.

Andreas Dilger added a comment - 30/May/19 8:10 AM Alex, ideally there would be a two-stage approach for FLR. For reads it would try whichever OST is preferred. If the OSC is offline then it could be skipped initially, and the read go to the other mirror copies if the OSCs are online. If none are online, then it should wait on the preferred OSC. For writes, the MDS selects which replica should be used, so the client will have to wait until the OSC is connected again.

Alex Zhuravlev added a comment - 30/May/19 4:03 AM

> to honor rq_no_delay in the RPC
thinking how to deal with that, but the important thing is that the semantics has changed - if the connection is idle, then you have to wait, some time to try to reconnect?
for example, all the connections can be idle, do you expect an error in this case?

Alex Zhuravlev added a comment - 30/May/19 4:03 AM > to honor rq_no_delay in the RPC thinking how to deal with that, but the important thing is that the semantics has changed - if the connection is idle, then you have to wait, some time to try to reconnect? for example, all the connections can be idle, do you expect an error in this case?

Joe Frith added a comment - 30/May/19 12:22 AM

This is a major issue for us as we cannot continue with maintenance unless this issue is fixed.

Will this be included in the next release 2.12.3? Any time-frame on this?

Joe Frith added a comment - 30/May/19 12:22 AM This is a major issue for us as we cannot continue with maintenance unless this issue is fixed. Will this be included in the next release 2.12.3? Any time-frame on this?

Andreas Dilger added a comment - 28/May/19 4:31 AM

I realized this piece of code has been removed from review:

JInshan, I don't see any record of that code on master. Was it maybe only in a patch on the FLR branch, or only in your local checkout?

Alex, any thoughts on how to fix the issue Jinshan describes?

Andreas Dilger added a comment - 28/May/19 4:31 AM I realized this piece of code has been removed from review: JInshan, I don't see any record of that code on master. Was it maybe only in a patch on the FLR branch, or only in your local checkout? Alex, any thoughts on how to fix the issue Jinshan describes?

People

Assignee:: Zhenyu Xu

Reporter:: Joe Frith

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 22/May/19 7:24 PM

Updated:: 25/Nov/19 9:18 PM

Resolved:: 18/Oct/19 1:09 AM