Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12328

FLR mirroring on 2.12.1-1 not usable if OST is down

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.13.0, Lustre 2.12.4
    • Lustre 2.12.1
    • None
    • RHEL 7.6
    • 3
    • 9223372036854775807

    Description

      See below for stripe details on the file "mirror10". If OST idx 1 is unmounted and made unavailable, performance drops down to 1/10th of expected performance. The client has to timeout on OST idx1 before it tries to read from OST idx 7. This happens for each 1MB block as that is the block size being used resulting in very poor performance. 

       

       

      $ lfs getstripe mirror10
      mirror10
       lcm_layout_gen: 5
       lcm_mirror_count: 2
       lcm_entry_count: 2
       lcme_id: 65537
       lcme_mirror_id: 1
       lcme_flags: init
       lcme_extent.e_start: 0
       lcme_extent.e_end: EOF
       lmm_stripe_count: 1
       lmm_stripe_size: 1048576
       lmm_pattern: raid0
       lmm_layout_gen: 0
       lmm_stripe_offset: 1
       lmm_pool: 01
       lmm_objects:
       - 0: { l_ost_idx: 1, l_fid: [0x100010000:0x280a8:0x0] }
      
      lcme_id: 131074
       lcme_mirror_id: 2
       lcme_flags: init
       lcme_extent.e_start: 0
       lcme_extent.e_end: EOF
       lmm_stripe_count: 1
       lmm_stripe_size: 1048576
       lmm_pattern: raid0
       lmm_layout_gen: 0
       lmm_stripe_offset: 7
       lmm_pool: 02
       lmm_objects:
       - 0: { l_ost_idx: 7, l_fid: [0x100070000:0x28066:0x0] }
      

      Attachments

        Issue Links

          Activity

            [LU-12328] FLR mirroring on 2.12.1-1 not usable if OST is down

            The patch was reverted because it was causing frequent crashes in testing (LU-12525).

            The original patch https://review.whamcloud.com/34952 "LU-12328 flr: avoid reading unhealthy mirror" should fix the original problem, but it needs to be refreshed again.

            adilger Andreas Dilger added a comment - The patch was reverted because it was causing frequent crashes in testing ( LU-12525 ). The original patch https://review.whamcloud.com/34952 " LU-12328 flr: avoid reading unhealthy mirror " should fix the original problem, but it needs to be refreshed again.
            raot Joe Frith added a comment -

            Did the patch get reverted after being included in the master? 

             

            We are still hoping that this issue gets resolved so we can go ahead with the maintenance. 

            raot Joe Frith added a comment - Did the patch get reverted after being included in the master?    We are still hoping that this issue gets resolved so we can go ahead with the maintenance. 

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35450/
            Subject: Revert "LU-12328 flr: preserve last read mirror"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 0a8750628d9a87f686b917c88e42093a52a78ae3

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35450/ Subject: Revert " LU-12328 flr: preserve last read mirror" Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0a8750628d9a87f686b917c88e42093a52a78ae3

            Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35450
            Subject: Revert "LU-12328 flr: preserve last read mirror"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e4788166435a05e5fe39107ebbcb167e13a74bcc

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35450 Subject: Revert " LU-12328 flr: preserve last read mirror" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e4788166435a05e5fe39107ebbcb167e13a74bcc

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35111/
            Subject: LU-12328 flr: preserve last read mirror
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 810f2a5fef577b4f0f6a58ab234cf29afd96c748

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35111/ Subject: LU-12328 flr: preserve last read mirror Project: fs/lustre-release Branch: master Current Patch Set: Commit: 810f2a5fef577b4f0f6a58ab234cf29afd96c748

            As yet the patch has not landed on master, so that would need to happen before it can land to b2_12.

            adilger Andreas Dilger added a comment - As yet the patch has not landed on master, so that would need to happen before it can land to b2_12.
            raot Joe Frith added a comment -

            Will this patch be included in Lustre 2.12.3? We are delaying maintenance because of this issue. 

            raot Joe Frith added a comment - Will this patch be included in Lustre 2.12.3? We are delaying maintenance because of this issue. 

            Jinshan Xiong (jinshan.xiong@gmail.com) uploaded a new patch: https://review.whamcloud.com/35111
            Subject: LU-12328 flr: preserve last read mirror
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7f9985832de7699f06fdef2916a280a3666ca7cf

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@gmail.com) uploaded a new patch: https://review.whamcloud.com/35111 Subject: LU-12328 flr: preserve last read mirror Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7f9985832de7699f06fdef2916a280a3666ca7cf

            I'm not against changing semantics of rq_no_delay, but it should be noticed that another users (like lfs df) would need to be changed then where the callers wants to try to connect at least once before giving up.
            given extra costs of preparing RPC I think a better/cheaper interface to check connection status is needed.

            bzzz Alex Zhuravlev added a comment - I'm not against changing semantics of rq_no_delay, but it should be noticed that another users (like lfs df) would need to be changed then where the callers wants to try to connect at least once before giving up. given extra costs of preparing RPC I think a better/cheaper interface to check connection status is needed.

            Alex, I definitely have some ideas on client-side read policy, in order to maximize global throughput vs. single-client throughput, from LDEV-436:

            In particular, it would be good if reading the same data from a file will normally read from the same OST/replica so that this can be handled from the cache, rather than using a random replica and forcing disk reads on multiple OSTs. Not only does this avoid disk activity, but it also avoids the case of a single client doing reads from multiple OSTs and needing to get DLM locks from each one.

            If the file is getting very large (e.g. multi-GB), it makes sense to spread the read workload across multiple OSTs in some deterministic manner (e.g. replica count and file offset) so that there are mutliple OSTs active on the file, and at least several of the replicas active if many clients are reading at different offsets from the same file.

            If there are large numbers of replicas for a single file, then the clients should spread the read workload across all of them (e.g. based on client NID), on the assumption that a user creates 10+ replicas of a file to increase the read bandwidth).

            adilger Andreas Dilger added a comment - Alex, I definitely have some ideas on client-side read policy, in order to maximize global throughput vs. single-client throughput, from LDEV-436: In particular, it would be good if reading the same data from a file will normally read from the same OST/replica so that this can be handled from the cache, rather than using a random replica and forcing disk reads on multiple OSTs. Not only does this avoid disk activity, but it also avoids the case of a single client doing reads from multiple OSTs and needing to get DLM locks from each one. If the file is getting very large (e.g. multi-GB), it makes sense to spread the read workload across multiple OSTs in some deterministic manner (e.g. replica count and file offset) so that there are mutliple OSTs active on the file, and at least several of the replicas active if many clients are reading at different offsets from the same file. If there are large numbers of replicas for a single file, then the clients should spread the read workload across all of them (e.g. based on client NID), on the assumption that a user creates 10+ replicas of a file to increase the read bandwidth).

            Throughput from a single node has never been a goal for FLR, so that current logic is to find an available mirror and stick with that one.

            And yes, I think we should not block `rq_no_delay`.

            Let's put the IDLE connection away a little bit first - if I remember it correctly, the user is actually experiencing a problem that the connection is in DISCON, should we fix it first?

            Jinshan Jinshan Xiong added a comment - Throughput from a single node has never been a goal for FLR, so that current logic is to find an available mirror and stick with that one. And yes, I think we should not block `rq_no_delay`. Let's put the IDLE connection away a little bit first - if I remember it correctly, the user is actually experiencing a problem that the connection is in DISCON, should we fix it first?

            People

              bobijam Zhenyu Xu
              raot Joe Frith
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: