Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6851

LU-6060 patch breaks multi-hop routing by default.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.5.4
    • 3
    • 9223372036854775807

    Description

      Change 531ef4d on b2_5_fe, the patch from LU-6060, would appear to have introduced a bug that caused a major system outage during our upgrade to 2.5.4 today. The patch was based on 749dc54 from master.

      The code makes this, to me, puzzling assumption:

      -               rtr->lr_downis = down + (ptl_status == LNET_NI_STATUS_DOWN);
      +               /* if @down is zero and this route is single-hop, it means
      +                * we can't find NI for target network */
      +               if (down == 0 && rtr->lr_hops == 1)
      +                       down = 1;
      +
      

      First of all, lr_hops == 1 is the default, and lnet has never required hops to be set even on multi-hop systems. You are explicitly introducing a constraint that never existed in the past, and you have failed to communicate that new constraint to your customers.

      Adding this new constraint, while having avoid_asym_router_failure enabled by default, and hops defaulting to 1, is a bug plain and simple.

      Hops and priority were invented for fine grain routing. They are not, and never have been, required to be set in multi-hop lnet routing situations. The Lustre manual even shows them as optional.

      And speaking of the manual, the documented grammar for setting hops and priority are almost certainly wrong.

      routes=dest_lnet [hop] [priority] router_NID@src_lnet; \
      					   dest_lnet [hop] [priority] router_NID@src_lnet
      

      Since both hop and priority are numbers, I kind of doubt that there is any way to specify either one without the other. I imagine the grammar is really either this:

      routes=dest_lnet [hop [priority]] router_NID@src_lnet; \
      					   dest_lnet [hop [priority]] router_NID@src_lnet
      

      or this:

      routes=dest_lnet [hop priority] router_NID@src_lnet; \
      					   dest_lnet [hop priority] router_NID@src_lnet
      

      Finally, what happened with the review process? I see no reviews listed on the 531ef4d patch, and no reference to the patch on master from which this was backported. And there were additional changes in the backported patch not found in the original, so a review really should have been required.

      Attachments

        Issue Links

          Activity

            People

              ashehata Amir Shehata (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: