Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5758

enabling avoid_asym_router_failure prvents the bring up of ORNL production systems

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.4.3, Lustre 2.5.3
    • Any 2.4/2.5 clients running against 2.4.3 or 2.5.3 servers.
    • 3
    • 16157

    Description

      With the deployment of Lustre 2.5 center wide at ORNL we encountered problems being up the production system due to avoid_sym_router_failure
      being enabled by default. The LNET fabric would fail to come up when enabled. Once it was turned off by default everything returned to normal. This would a useful feature at have for this scale of a system. This problem can be easily reproduced at smaller scale in both non-FGR and FGR set ups.

      Attachments

        Issue Links

          Activity

            [LU-5758] enabling avoid_asym_router_failure prvents the bring up of ORNL production systems
            pjones Peter Jones added a comment -

            Great - thanks James

            pjones Peter Jones added a comment - Great - thanks James
            simmonsja James A Simmons added a comment - - edited

            No problems. We have been running ARF for months now. You can close this ticket.

            simmonsja James A Simmons added a comment - - edited No problems. We have been running ARF for months now. You can close this ticket.
            yujian Jian Yu added a comment -

            Hi James,
            Does the ARF issue still exist? If no, can we close this ticket as resolved?

            yujian Jian Yu added a comment - Hi James, Does the ARF issue still exist? If no, can we close this ticket as resolved?

            it's ok to mix them, in that case client/server just can't find out failed remote NI on routers.

            liang Liang Zhen (Inactive) added a comment - it's ok to mix them, in that case client/server just can't find out failed remote NI on routers.
            yujian Jian Yu added a comment -

            Hi Liang,
            With the two LNet patches you mentioned above, ORNL tested ARF on a small scale cluster and found it worked. They are going to test it on a large scale cluster.

            Now, they have a question about ARF:
            "Does this need to be completely on or completely off, or is it possible to have some clusters have it enabled and not others?"

            Hi James,
            Could you please explain more in case I did not describe the question clearly? Thank you.

            yujian Jian Yu added a comment - Hi Liang, With the two LNet patches you mentioned above, ORNL tested ARF on a small scale cluster and found it worked. They are going to test it on a large scale cluster. Now, they have a question about ARF: "Does this need to be completely on or completely off, or is it possible to have some clusters have it enabled and not others?" Hi James, Could you please explain more in case I did not describe the question clearly? Thank you.
            liang Liang Zhen (Inactive) added a comment - - edited

            Hi James, because there are many discussions already, I think we should summarise where we are now, please check and correct me if I'm wrong or missed something :

            • there are two LNet patches in your environment
            • both patches are applied to all nodes (client, server and router)
            • If there is any other LNet patch, could you post link of patches, or tarball of lnet.
            • ARF can't detect unplugged IB interface on router with above changes
              • in the comment posted at 31/Dec/14 8:59 AM click here , we found router status on client is wrong while unplugging IB interface facing server. But router status on server is correct after unplugging IB interface facing client.
            • This problem could either because router can't set NI status to "DOWN" for unknown reason, or because client/server record wrong NI status for router by mistake. To find out which one is the real reason, first we need to check NI status on router after unplugging NI. Because it will take a few minutes for LNet to detect and mark a NI as down, so we need to sample /proc/sys/lnet/nis for 5 minutes, and sample once per 10 seconds.
              • if NI status stays UP forever, then we know router has defect and it can't change NI status.
              • if NI status turns to DOWN, it's probably a bug on non-router node, could you check /proc/sys/lnet/routes and /proc/sys/lnet/routers on client and server.
            • we also need to know these values in your environment:
              • live_router_check_interval
              • live_router_check_interval
              • router_ping_timeout

            I also have another question:

            • The "unplugging IB interface" at here has any difference with the experiment for LU-6060?

            Isaac,do you have any advice or insight on this issue?

            liang Liang Zhen (Inactive) added a comment - - edited Hi James, because there are many discussions already, I think we should summarise where we are now, please check and correct me if I'm wrong or missed something : there are two LNet patches in your environment http://review.whamcloud.com/#/c/12435 ( LU-5485 lnet: peer aliveness status and NI status) http://review.whamcloud.com/#/c/13162 ( LU-6060 lnet: set downis to 1 if there's no NI for remote net) both patches are applied to all nodes (client, server and router) If there is any other LNet patch, could you post link of patches, or tarball of lnet. ARF can't detect unplugged IB interface on router with above changes in the comment posted at 31/Dec/14 8:59 AM click here , we found router status on client is wrong while unplugging IB interface facing server. But router status on server is correct after unplugging IB interface facing client. This problem could either because router can't set NI status to "DOWN" for unknown reason, or because client/server record wrong NI status for router by mistake. To find out which one is the real reason, first we need to check NI status on router after unplugging NI. Because it will take a few minutes for LNet to detect and mark a NI as down, so we need to sample /proc/sys/lnet/nis for 5 minutes, and sample once per 10 seconds. if NI status stays UP forever, then we know router has defect and it can't change NI status. if NI status turns to DOWN, it's probably a bug on non-router node, could you check /proc/sys/lnet/routes and /proc/sys/lnet/routers on client and server. we also need to know these values in your environment: live_router_check_interval live_router_check_interval router_ping_timeout I also have another question: The "unplugging IB interface" at here has any difference with the experiment for LU-6060 ? Isaac,do you have any advice or insight on this issue?

            Liang I'm setting up the test. I should have results soon. I think you want someone else besides the two above live_router_check_intervals

            simmonsja James A Simmons added a comment - Liang I'm setting up the test. I should have results soon. I think you want someone else besides the two above live_router_check_intervals
            liang Liang Zhen (Inactive) added a comment - - edited

            Yujian, yes we need more logs to find out the reason.

            James, in the next round of test, could you please sample lnet/nis on router for each 10 seconds and recording it for total 300 seconds.

            Also, it is helpful to let us know these values on router: live_router_check_interval, live_router_check_interval, router_ping_timeout. And if possible, could you post your lnet source at here so I can check all patches.

            Thanks in advance.

            liang Liang Zhen (Inactive) added a comment - - edited Yujian, yes we need more logs to find out the reason. James, in the next round of test, could you please sample lnet/nis on router for each 10 seconds and recording it for total 300 seconds. Also, it is helpful to let us know these values on router: live_router_check_interval, live_router_check_interval, router_ping_timeout. And if possible, could you post your lnet source at here so I can check all patches. Thanks in advance.
            yujian Jian Yu added a comment -

            I see, thank you James for the clarification.

            Hi Liang,
            Would you like James to gather more logs to help investigate the ARF issue?

            yujian Jian Yu added a comment - I see, thank you James for the clarification. Hi Liang, Would you like James to gather more logs to help investigate the ARF issue?

            The test were done with 12435. It helped in that we can now mount lustre with ARF but now ARF itself doesn't work.

            simmonsja James A Simmons added a comment - The test were done with 12435. It helped in that we can now mount lustre with ARF but now ARF itself doesn't work.

            People

              liang Liang Zhen (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: