Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12328

FLR mirroring on 2.12.1-1 not usable if OST is down

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.13.0, Lustre 2.12.4
    • Lustre 2.12.1
    • None
    • RHEL 7.6
    • 3
    • 9223372036854775807

    Description

      See below for stripe details on the file "mirror10". If OST idx 1 is unmounted and made unavailable, performance drops down to 1/10th of expected performance. The client has to timeout on OST idx1 before it tries to read from OST idx 7. This happens for each 1MB block as that is the block size being used resulting in very poor performance. 

       

       

      $ lfs getstripe mirror10
      mirror10
       lcm_layout_gen: 5
       lcm_mirror_count: 2
       lcm_entry_count: 2
       lcme_id: 65537
       lcme_mirror_id: 1
       lcme_flags: init
       lcme_extent.e_start: 0
       lcme_extent.e_end: EOF
       lmm_stripe_count: 1
       lmm_stripe_size: 1048576
       lmm_pattern: raid0
       lmm_layout_gen: 0
       lmm_stripe_offset: 1
       lmm_pool: 01
       lmm_objects:
       - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x280a8:0x0] }
      
      lcme_id: 131074
       lcme_mirror_id: 2
       lcme_flags: init
       lcme_extent.e_start: 0
       lcme_extent.e_end: EOF
       lmm_stripe_count: 1
       lmm_stripe_size: 1048576
       lmm_pattern: raid0
       lmm_layout_gen: 0
       lmm_stripe_offset: 7
       lmm_pool: 02
       lmm_objects:
       - 0: { l_ost_idx: 7, l_fid: [0x100070000:0x28066:0x0] }
      

      Attachments

        Issue Links

          Activity

            [LU-12328] FLR mirroring on 2.12.1-1 not usable if OST is down

            Throughput from a single node has never been a goal for FLR, so that current logic is to find an available mirror and stick with that one.

            And yes, I think we should not block `rq_no_delay`.

            Let's put the IDLE connection away a little bit first - if I remember it correctly, the user is actually experiencing a problem that the connection is in DISCON, should we fix it first?

            Jinshan Jinshan Xiong added a comment - Throughput from a single node has never been a goal for FLR, so that current logic is to find an available mirror and stick with that one. And yes, I think we should not block `rq_no_delay`. Let's put the IDLE connection away a little bit first - if I remember it correctly, the user is actually experiencing a problem that the connection is in DISCON, should we fix it first?

            in terms of latency it makes sense to use first/any available target. in terms of throughput it makes sense to balance I/O among the targets. thus I guess the code should be able to detect the point where IO becomes "massive" for specific object and then use idling connections, but not sooner?
            would hitting max-RPC-in-flight be another way to detect when balancing makes sense?
            yet another thougth is that it doesn't makes sense to allocate/prepare RPC against idling connection unless we really want to use it - i.e. rq_no_delay isn't really the best interface for this kind of logic. and even when we want to enbale that connection (due to balancing), we don't want to block with rq_no_delay, but proceed with FULL one and initiate idling one?

            bzzz Alex Zhuravlev added a comment - in terms of latency it makes sense to use first/any available target. in terms of throughput it makes sense to balance I/O among the targets. thus I guess the code should be able to detect the point where IO becomes "massive" for specific object and then use idling connections, but not sooner? would hitting max-RPC-in-flight be another way to detect when balancing makes sense? yet another thougth is that it doesn't makes sense to allocate/prepare RPC against idling connection unless we really want to use it - i.e. rq_no_delay isn't really the best interface for this kind of logic. and even when we want to enbale that connection (due to balancing), we don't want to block with rq_no_delay, but proceed with FULL one and initiate idling one?

            JInshan, I don't see any record of that code on master. Was it maybe only in a patch on the FLR branch, or only in your local checkout?

            Yes, I found this piece of code from my local branch. That version of patch was an earlier version of implementation. Notice that the name was 'replica' at that time. I believe it should be in one of abandoned patch.

            If the connection is in IDLE state, probably it should return immediately if the RPC has `rq_no_delay` set, and I tend to think it also should kick off the reconnection asynchronously.

            In the current implementation of FLR, it iterates all mirrors until it finds one available to read. If none is available, it will wait for 10ms and restart trying. Hopefully, it would find that one mirror that becomes available to read if it kicks off reconnect earlier.

            Jinshan Jinshan Xiong added a comment - JInshan, I don't see any record of that code on master. Was it maybe only in a patch on the FLR branch, or only in your local checkout? Yes, I found this piece of code from my local branch. That version of patch was an earlier version of implementation. Notice that the name was 'replica' at that time. I believe it should be in one of abandoned patch. If the connection is in IDLE state, probably it should return immediately if the RPC has `rq_no_delay` set, and I tend to think it also should kick off the reconnection asynchronously. In the current implementation of FLR, it iterates all mirrors until it finds one available to read. If none is available, it will wait for 10ms and restart trying. Hopefully, it would find that one mirror that becomes available to read if it kicks off reconnect earlier.

            that logic would preevnt balancing?

            bzzz Alex Zhuravlev added a comment - that logic would preevnt balancing?

            Alex, ideally there would be a two-stage approach for FLR. For reads it would try whichever OST is preferred. If the OSC is offline then it could be skipped initially, and the read go to the other mirror copies if the OSCs are online. If none are online, then it should wait on the preferred OSC. For writes, the MDS selects which replica should be used, so the client will have to wait until the OSC is connected again.

            adilger Andreas Dilger added a comment - Alex, ideally there would be a two-stage approach for FLR. For reads it would try whichever OST is preferred. If the OSC is offline then it could be skipped initially, and the read go to the other mirror copies if the OSCs are online. If none are online, then it should wait on the preferred OSC. For writes, the MDS selects which replica should be used, so the client will have to wait until the OSC is connected again.

            > to honor rq_no_delay in the RPC
            thinking how to deal with that, but the important thing is that the semantics has changed - if the connection is idle, then you have to wait, some time to try to reconnect?
            for example, all the connections can be idle, do you expect an error in this case?

            bzzz Alex Zhuravlev added a comment - > to honor rq_no_delay in the RPC thinking how to deal with that, but the important thing is that the semantics has changed - if the connection is idle, then you have to wait, some time to try to reconnect? for example, all the connections can be idle, do you expect an error in this case?
            raot Joe Frith added a comment -

            This is a major issue for us as we cannot continue with maintenance unless this issue is fixed. 

            Will this be included in the next release 2.12.3? Any time-frame on this?

            raot Joe Frith added a comment - This is a major issue for us as we cannot continue with maintenance unless this issue is fixed.  Will this be included in the next release 2.12.3? Any time-frame on this?

            I realized this piece of code has been removed from review:

            JInshan, I don't see any record of that code on master. Was it maybe only in a patch on the FLR branch, or only in your local checkout?

            Alex, any thoughts on how to fix the issue Jinshan describes?

            adilger Andreas Dilger added a comment - I realized this piece of code has been removed from review: JInshan, I don't see any record of that code on master. Was it maybe only in a patch on the FLR branch, or only in your local checkout? Alex, any thoughts on how to fix the issue Jinshan describes?
            Jinshan Jinshan Xiong added a comment - - edited

            I realized this piece of code has been removed from review:

             247                 /* move replica index to the next one */
             248                 spin_lock(&comp->lo_write_lock);
             249                 if (index == comp->lo_preferred_replica) {
             250                         do {
             251                                 index = (index + 1) % comp->lo_entry_count;
             252                                 if (comp->lo_entries[index].lle_valid)
             253                                         break;
             254                         } while (index != comp->lo_preferred_replica);
             255
             256                         /* reset preferred replica so that other threads can
             257                          * take advantage of our retries. */
             258                         comp->lo_preferred_replica = index;
             259                 } else {
             260                         /* preferred index was moved by other thread */
             261                         index = comp->lo_preferred_replica;
             262                 }
             263                 spin_unlock(&comp->lo_write_lock);
             264
             265                 CDEBUG(D_VFSTRACE, DFID " move replica from %d to %d, "
             266                        "have retried: %d\n", PFID(lu_object_fid(lov2lu(obj))),
             267                        lio->lis_replica_index, index, io->ci_ndelay_tried);
             268
             269                 lio->lis_replica_index = index;
            

            This piece of code just tracks the last successful mirror. Right now, since lo_preferred_mirror is also used by write, we should add a new field called lo_last_success_mirror or similar to track the last success mirror.

            Tying the last success mirror to open file seems to have a problem that different opening files would use different mirror therefore the same piece of data would be cached more than once.

            Meanwhile, we should revise the commit:

            commit 5a6ceb664f07812c351786c1043da71ff5027f8c
            Author: Alex Zhuravlev <alexey.zhuravlev@intel.com>
            Date:   Mon Sep 28 16:50:15 2015 +0300
            
                LU-7236 ptlrpc: idle connections can disconnect
            

            to honor rq_no_delay in the RPC

            Jinshan Jinshan Xiong added a comment - - edited I realized this piece of code has been removed from review: 247 /* move replica index to the next one */ 248 spin_lock(&comp->lo_write_lock); 249 if (index == comp->lo_preferred_replica) { 250 do { 251 index = (index + 1) % comp->lo_entry_count; 252 if (comp->lo_entries[index].lle_valid) 253 break; 254 } while (index != comp->lo_preferred_replica); 255 256 /* reset preferred replica so that other threads can 257 * take advantage of our retries. */ 258 comp->lo_preferred_replica = index; 259 } else { 260 /* preferred index was moved by other thread */ 261 index = comp->lo_preferred_replica; 262 } 263 spin_unlock(&comp->lo_write_lock); 264 265 CDEBUG(D_VFSTRACE, DFID " move replica from %d to %d, " 266 "have retried: %d\n", PFID(lu_object_fid(lov2lu(obj))), 267 lio->lis_replica_index, index, io->ci_ndelay_tried); 268 269 lio->lis_replica_index = index; This piece of code just tracks the last successful mirror. Right now, since lo_preferred_mirror is also used by write, we should add a new field called lo_last_success_mirror or similar to track the last success mirror. Tying the last success mirror to open file seems to have a problem that different opening files would use different mirror therefore the same piece of data would be cached more than once. Meanwhile, we should revise the commit: commit 5a6ceb664f07812c351786c1043da71ff5027f8c Author: Alex Zhuravlev <alexey.zhuravlev@intel.com> Date: Mon Sep 28 16:50:15 2015 +0300 LU-7236 ptlrpc: idle connections can disconnect to honor rq_no_delay in the RPC

            Alex's patch to move lu_env into a global hash indexes by the process could help this case, so that lu_env stays around over multiple syscalls. However, there would be lifetime issues for how long we need to keep the lu_env after a syscall completes, since we would not be notified when a process exits...

            We could tie the state to the open file handle instead of the lu_env, which would be an improvement (file handles disappear when the process exits), but keeping it in the file layout (in memory) as suggested on the patch would be even better. This state affects all threads accessing that file, so there is no reason to keep it in a per process structure like lu_env or file descriptor and have to re-discover it for each thread (or syscall!) accessing that file.

            My top preference is that we track the state on the OSC itself, since this is really global to all files stored on the OST. That is essentially what the "active" state is.

            adilger Andreas Dilger added a comment - Alex's patch to move lu_env into a global hash indexes by the process could help this case, so that lu_env stays around over multiple syscalls. However, there would be lifetime issues for how long we need to keep the lu_env after a syscall completes, since we would not be notified when a process exits... We could tie the state to the open file handle instead of the lu_env, which would be an improvement (file handles disappear when the process exits), but keeping it in the file layout (in memory) as suggested on the patch would be even better. This state affects all threads accessing that file, so there is no reason to keep it in a per process structure like lu_env or file descriptor and have to re-discover it for each thread (or syscall!) accessing that file. My top preference is that we track the state on the OSC itself, since this is really global to all files stored on the OST. That is essentially what the "active" state is.
            bobijam Zhenyu Xu added a comment -

            From the log I can see that for every 1MB read, the lov_io's lis_mirror_index is cleared so that the read try start reading from mirror 0 again.

            In every ll_file_aio_read() iteration, the lu_env is a new one so that lov_io::lis_mirror_index does not have its old value I think.

            bobijam Zhenyu Xu added a comment - From the log I can see that for every 1MB read, the lov_io's lis_mirror_index is cleared so that the read try start reading from mirror 0 again. In every ll_file_aio_read() iteration, the lu_env is a new one so that lov_io::lis_mirror_index does not have its old value I think.

            People

              bobijam Zhenyu Xu
              raot Joe Frith
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: