Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6851

LU-6060 patch breaks multi-hop routing by default.

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.5.4
    • 3
    • 9223372036854775807

    Description

      Change 531ef4d on b2_5_fe, the patch from LU-6060, would appear to have introduced a bug that caused a major system outage during our upgrade to 2.5.4 today. The patch was based on 749dc54 from master.

      The code makes this, to me, puzzling assumption:

      -               rtr->lr_downis = down + (ptl_status == LNET_NI_STATUS_DOWN);
      +               /* if @down is zero and this route is single-hop, it means
      +                * we can't find NI for target network */
      +               if (down == 0 && rtr->lr_hops == 1)
      +                       down = 1;
      +
      

      First of all, lr_hops == 1 is the default, and lnet has never required hops to be set even on multi-hop systems. You are explicitly introducing a constraint that never existed in the past, and you have failed to communicate that new constraint to your customers.

      Adding this new constraint, while having avoid_asym_router_failure enabled by default, and hops defaulting to 1, is a bug plain and simple.

      Hops and priority were invented for fine grain routing. They are not, and never have been, required to be set in multi-hop lnet routing situations. The Lustre manual even shows them as optional.

      And speaking of the manual, the documented grammar for setting hops and priority are almost certainly wrong.

      routes=dest_lnet [hop] [priority] router_NID@src_lnet; \
      					   dest_lnet [hop] [priority] router_NID@src_lnet
      

      Since both hop and priority are numbers, I kind of doubt that there is any way to specify either one without the other. I imagine the grammar is really either this:

      routes=dest_lnet [hop [priority]] router_NID@src_lnet; \
      					   dest_lnet [hop [priority]] router_NID@src_lnet
      

      or this:

      routes=dest_lnet [hop priority] router_NID@src_lnet; \
      					   dest_lnet [hop priority] router_NID@src_lnet
      

      Finally, what happened with the review process? I see no reviews listed on the 531ef4d patch, and no reference to the patch on master from which this was backported. And there were additional changes in the backported patch not found in the original, so a review really should have been required.

      Attachments

        Issue Links

          Activity

            [LU-6851] LU-6060 patch breaks multi-hop routing by default.
            jgmitter Joseph Gmitter (Inactive) added a comment - http://review.whamcloud.com/#/c/15719/ has landed for 2.8.0

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15719/
            Subject: LU-6851 lnet: Ignore hops if not explicitly set
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: cdc97ad749b541ea2e2f1c0b5b9bc35c37762877

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15719/ Subject: LU-6851 lnet: Ignore hops if not explicitly set Project: fs/lustre-release Branch: master Current Patch Set: Commit: cdc97ad749b541ea2e2f1c0b5b9bc35c37762877

            Thanks, Olaf, that is a very nice summary.

            I'm not sure about the accuracy of this sentence: "The code from LU-6060 works for ORNL for their primary (hopcount 1) routers, but not for their secondaries (configured hopcount > 1)." The LU-6060 change was primarily to improve detection of a fault that ORNL was experiencing. But that probably isn't too important for the discussion.

            Option 3 was an interesting idea, but you are correct that there is a guaranteed breakage. I also worry a bit about locking in the idea that a number in the valid hop count range doesn't actually represent the real lnet hop count.

            Options 1 and 2 have the same primary solution, and just differ on the secondary implementation details. I have been suggesting that primary solution all along, so I am certainly still in favor of it. I, of course recommended the secondary details in Option 2. But I'm not wed to them.

            I'll throw my support behind Option 1 as well.

            morrone Christopher Morrone (Inactive) added a comment - Thanks, Olaf, that is a very nice summary. I'm not sure about the accuracy of this sentence: "The code from LU-6060 works for ORNL for their primary (hopcount 1) routers, but not for their secondaries (configured hopcount > 1)." The LU-6060 change was primarily to improve detection of a fault that ORNL was experiencing. But that probably isn't too important for the discussion. Option 3 was an interesting idea, but you are correct that there is a guaranteed breakage. I also worry a bit about locking in the idea that a number in the valid hop count range doesn't actually represent the real lnet hop count. Options 1 and 2 have the same primary solution, and just differ on the secondary implementation details. I have been suggesting that primary solution all along, so I am certainly still in favor of it. I, of course recommended the secondary details in Option 2. But I'm not wed to them. I'll throw my support behind Option 1 as well.

            My attempt to summarize, based on the thread above and the 15719 review comments.

            • ORNL has an LNet setup with single-hop routes and fine-grained routing. The configuration specifies different hopcounts because route priority had not yet been introduced when this was set up. The code from LU-6060 works for ORNL for their primary (hopcount 1) routers, but not for their secondaries (configured hopcount > 1).
            • LLNL has an LNet setup with multi-hop routes. They do not specify the hopcount in their routes. The code for LU-6060 breaks this setup, because no traffic flows across multi-hop routes.

            In 15719 there are three proposals on the table for fixing this:

            1. Have a special "not set" value sent to the kernel when the hopcount wasn't specified.
              • lnet_compare_routes() treats unset as hopcount 1.
              • LNetDist() treats unset as hopcount 1.
              • lnet_parse_rc_info() treats unset to mean a route may be multi-hop.
            2. Have a special "not set" value sent to the kernel when the hopcount wasn't specified.
              • lnet_compare_routes() does not compare hopcounts if one of them was unset.
              • LNetDist() has to return something, unclear what, which cannot be an error.
              • lnet_parse_rc_info() treats unset to mean a route may be multi-hop.
            3. Change the default for unset hopcount to 255 (largest hopcount), so effectively:
              • lnet_compare_routes() treats unset as hopcount 255.
              • LNetDist() treats unset as hopcount 255.
              • lnet_parse_rc_info() treats unset as hopcount 255.

            Option (3) is the simplest to implement. It is also guaranteed to break multi-hop configurations where only the non-default hopcounts have been specified.

            Option (2) requires a decision to be made on what LNetDist() should do when it sees an undefined hopcount.

            Option (1) minimizes the change to behavior. I count that as an advantage because an unanticipated change in behavior is what landed us in this mess in the first place. For that reason I strongly recommend going with this option.

            olaf Olaf Weber (Inactive) added a comment - My attempt to summarize, based on the thread above and the 15719 review comments. ORNL has an LNet setup with single-hop routes and fine-grained routing. The configuration specifies different hopcounts because route priority had not yet been introduced when this was set up. The code from LU-6060 works for ORNL for their primary (hopcount 1) routers, but not for their secondaries (configured hopcount > 1). LLNL has an LNet setup with multi-hop routes. They do not specify the hopcount in their routes. The code for LU-6060 breaks this setup, because no traffic flows across multi-hop routes. In 15719 there are three proposals on the table for fixing this: Have a special "not set" value sent to the kernel when the hopcount wasn't specified. lnet_compare_routes() treats unset as hopcount 1. LNetDist() treats unset as hopcount 1. lnet_parse_rc_info() treats unset to mean a route may be multi-hop. Have a special "not set" value sent to the kernel when the hopcount wasn't specified. lnet_compare_routes() does not compare hopcounts if one of them was unset. LNetDist() has to return something, unclear what, which cannot be an error. lnet_parse_rc_info() treats unset to mean a route may be multi-hop. Change the default for unset hopcount to 255 (largest hopcount), so effectively: lnet_compare_routes() treats unset as hopcount 255. LNetDist() treats unset as hopcount 255. lnet_parse_rc_info() treats unset as hopcount 255. Option (3) is the simplest to implement. It is also guaranteed to break multi-hop configurations where only the non-default hopcounts have been specified. Option (2) requires a decision to be made on what LNetDist() should do when it sees an undefined hopcount. Option (1) minimizes the change to behavior. I count that as an advantage because an unanticipated change in behavior is what landed us in this mess in the first place. For that reason I strongly recommend going with this option.

            Because you and Cray wanted to build improvements to ARF on top of the hop count, and I'm all about compromise.

            But seriously, that is an interesting thought...I'm not entirely sure what other purpose it serves now that we have priority. Is there a real use case for having hops on top of priority for routing purposes? Can the ARF be made to work under the additional required failure modes without using hops? Somebody would have to give that more thought.

            15719 would still be reasonable to land in the mean time, even if we decided to take hops out at some point.

            morrone Christopher Morrone (Inactive) added a comment - Because you and Cray wanted to build improvements to ARF on top of the hop count, and I'm all about compromise. But seriously, that is an interesting thought...I'm not entirely sure what other purpose it serves now that we have priority. Is there a real use case for having hops on top of priority for routing purposes? Can the ARF be made to work under the additional required failure modes without using hops? Somebody would have to give that more thought. 15719 would still be reasonable to land in the mean time, even if we decided to take hops out at some point.

            Since the hops configuration requirements far exceeds the effort level an system administration can handle and according to your experience with multi hop the best configuration is to ignore the hops setting then why don't we just rip out the hops code then.

            simmonsja James A Simmons added a comment - Since the hops configuration requirements far exceeds the effort level an system administration can handle and according to your experience with multi hop the best configuration is to ignore the hops setting then why don't we just rip out the hops code then.

            We don't have scripts to generate lnet configurations per say, just configuration files.

            I don't think you are quite reading it right. I am suggesting that LU-6060 is the wrong approach for lustre, and that is why it shouldn't be in the code. Lustre is a filesystem designed for massive scalability. We need to keep in mind that scalability is not only about scaling data/metadata performance, is it also about scalability of configuration. Lustre is already more difficult to configure and administrate then it needs to be, and LU-6060 made configuration in large installations even harder.

            LU-6060 introduced a new requirement to LNet that all hops counts in a routed configuration have to be correct or LNet flat out does not work. I am saying that is a poor design for scalable configuration, and that is the reason it didn't belong in Lustre/LNet. Fortunately, the benefits of LU-6060 could be entirely optional! We can allow LNet to continue to be configured the same way it has been since inception LNet routers, while also providing additional checks and benefits to the users that are willing to go the extra mile during configuration.

            The changes that you made to use priority didn't effect LLNL in any way that I can see, and doesn't have all that much to do with LU-6060 either. If you made changes to use priority to express networking priority you were simply updating your scripts that originated back in the old days when there was no priority option, and you therefore needed to use hops to express priority. That was a reasonable decision at the time. I also understand how old configurations can linger for quite some time. I didn't care one way or another if you changed you scripts. It has no effect on LLNL. I think it was reasonable to upgrade those scripts at some point to use the priority option to express your route priority, but that change didn't have to happen in the context of this ticket or LU-6060. That didn't help anyone outside of ORNL. Since it provided no external benefit, I don't think that can be included in any perceived compromise.

            I apologize if something I said caused misunderstanding that resulted in you doing work earlier than you had planned.

            Landing change 15719 is the compromise position. It allows the configuration lnet routing to continue to work as it has since its inception by default, while also at the same time providing the benefits of LU-6060 to those persons who choose to take on the additional effort of setting hops counts everywhere.

            Change 15719 won't harm you, and it won't require us to accept new configuration burdens. Change 15719 is what the LU-6060 patch should have contained before we allowed it to land.

            Change 15719 puts right what once went wrong.

            morrone Christopher Morrone (Inactive) added a comment - We don't have scripts to generate lnet configurations per say, just configuration files. I don't think you are quite reading it right. I am suggesting that LU-6060 is the wrong approach for lustre, and that is why it shouldn't be in the code. Lustre is a filesystem designed for massive scalability. We need to keep in mind that scalability is not only about scaling data/metadata performance, is it also about scalability of configuration. Lustre is already more difficult to configure and administrate then it needs to be, and LU-6060 made configuration in large installations even harder. LU-6060 introduced a new requirement to LNet that all hops counts in a routed configuration have to be correct or LNet flat out does not work. I am saying that is a poor design for scalable configuration, and that is the reason it didn't belong in Lustre/LNet. Fortunately, the benefits of LU-6060 could be entirely optional! We can allow LNet to continue to be configured the same way it has been since inception LNet routers, while also providing additional checks and benefits to the users that are willing to go the extra mile during configuration. The changes that you made to use priority didn't effect LLNL in any way that I can see, and doesn't have all that much to do with LU-6060 either. If you made changes to use priority to express networking priority you were simply updating your scripts that originated back in the old days when there was no priority option, and you therefore needed to use hops to express priority. That was a reasonable decision at the time. I also understand how old configurations can linger for quite some time. I didn't care one way or another if you changed you scripts. It has no effect on LLNL. I think it was reasonable to upgrade those scripts at some point to use the priority option to express your route priority, but that change didn't have to happen in the context of this ticket or LU-6060 . That didn't help anyone outside of ORNL. Since it provided no external benefit, I don't think that can be included in any perceived compromise. I apologize if something I said caused misunderstanding that resulted in you doing work earlier than you had planned. Landing change 15719 is the compromise position. It allows the configuration lnet routing to continue to work as it has since its inception by default , while also at the same time providing the benefits of LU-6060 to those persons who choose to take on the additional effort of setting hops counts everywhere. Change 15719 won't harm you, and it won't require us to accept new configuration burdens. Change 15719 is what the LU-6060 patch should have contained before we allowed it to land. Change 15719 puts right what once went wrong.

            Am I reading this right but basically you are stating that Andreas suggestion to fix up your scripts for the hop count is the wrong thing since it is not what you do. Earlier for the LU-6060 patch your argument was revert the bogus patch and fix your scripts to us. Now its my turn to say the same thing. Do the right thing by fixing your scripts and forget about 15719 since it is not a proper fix.

            We compromise on the 2.5 LU-6060 revert and we are spending valuable time and resource to move to priority FGR as a gesture of good will to your needs. It is not cool if you refused to the do the same for us for the 2.8 release.

            simmonsja James A Simmons added a comment - Am I reading this right but basically you are stating that Andreas suggestion to fix up your scripts for the hop count is the wrong thing since it is not what you do. Earlier for the LU-6060 patch your argument was revert the bogus patch and fix your scripts to us. Now its my turn to say the same thing. Do the right thing by fixing your scripts and forget about 15719 since it is not a proper fix. We compromise on the 2.5 LU-6060 revert and we are spending valuable time and resource to move to priority FGR as a gesture of good will to your needs. It is not cool if you refused to the do the same for us for the 2.8 release.

            I would like to see the 15719 change finished and landed for 2.8.0.

            Just documenting the change is not sufficient for LLNL for 2.8. We do not think that added requirement is a particularly wise configuration design choice. If you make it optional with 15719, then it won't bother us.

            For the Intel's 2.5 FE branch, yes please revert the LU-6060 patch.

            morrone Christopher Morrone (Inactive) added a comment - I would like to see the 15719 change finished and landed for 2.8.0. Just documenting the change is not sufficient for LLNL for 2.8. We do not think that added requirement is a particularly wise configuration design choice. If you make it optional with 15719, then it won't bother us. For the Intel's 2.5 FE branch, yes please revert the LU-6060 patch.

            I would like to see the 15719 change finished and landed for 2.8.0.

            Just documenting the change is not sufficient for LLNL for 2.8. We do not think that added requirement is a particularly wise configuration design choice. If you make it optional with 15719, then it won't bother us.

            For the Intel's 2.5 FE branch, yes please revert the LU-6060 patch.

            morrone Christopher Morrone (Inactive) added a comment - I would like to see the 15719 change finished and landed for 2.8.0. Just documenting the change is not sufficient for LLNL for 2.8. We do not think that added requirement is a particularly wise configuration design choice. If you make it optional with 15719, then it won't bother us. For the Intel's 2.5 FE branch, yes please revert the LU-6060 patch.

            People

              ashehata Amir Shehata (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: