Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5758

enabling avoid_asym_router_failure prvents the bring up of ORNL production systems

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.4.3, Lustre 2.5.3
    • Any 2.4/2.5 clients running against 2.4.3 or 2.5.3 servers.
    • 3
    • 16157

    Description

      With the deployment of Lustre 2.5 center wide at ORNL we encountered problems being up the production system due to avoid_sym_router_failure
      being enabled by default. The LNET fabric would fail to come up when enabled. Once it was turned off by default everything returned to normal. This would a useful feature at have for this scale of a system. This problem can be easily reproduced at smaller scale in both non-FGR and FGR set ups.

      Attachments

        Issue Links

          Activity

            [LU-5758] enabling avoid_asym_router_failure prvents the bring up of ORNL production systems
            pjones Peter Jones added a comment -

            Great - thanks James

            pjones Peter Jones added a comment - Great - thanks James
            simmonsja James A Simmons added a comment - - edited

            No problems. We have been running ARF for months now. You can close this ticket.

            simmonsja James A Simmons added a comment - - edited No problems. We have been running ARF for months now. You can close this ticket.
            yujian Jian Yu added a comment -

            Hi James,
            Does the ARF issue still exist? If no, can we close this ticket as resolved?

            yujian Jian Yu added a comment - Hi James, Does the ARF issue still exist? If no, can we close this ticket as resolved?

            it's ok to mix them, in that case client/server just can't find out failed remote NI on routers.

            liang Liang Zhen (Inactive) added a comment - it's ok to mix them, in that case client/server just can't find out failed remote NI on routers.
            yujian Jian Yu added a comment -

            Hi Liang,
            With the two LNet patches you mentioned above, ORNL tested ARF on a small scale cluster and found it worked. They are going to test it on a large scale cluster.

            Now, they have a question about ARF:
            "Does this need to be completely on or completely off, or is it possible to have some clusters have it enabled and not others?"

            Hi James,
            Could you please explain more in case I did not describe the question clearly? Thank you.

            yujian Jian Yu added a comment - Hi Liang, With the two LNet patches you mentioned above, ORNL tested ARF on a small scale cluster and found it worked. They are going to test it on a large scale cluster. Now, they have a question about ARF: "Does this need to be completely on or completely off, or is it possible to have some clusters have it enabled and not others?" Hi James, Could you please explain more in case I did not describe the question clearly? Thank you.
            liang Liang Zhen (Inactive) added a comment - - edited

            Hi James, because there are many discussions already, I think we should summarise where we are now, please check and correct me if I'm wrong or missed something :

            • there are two LNet patches in your environment
            • both patches are applied to all nodes (client, server and router)
            • If there is any other LNet patch, could you post link of patches, or tarball of lnet.
            • ARF can't detect unplugged IB interface on router with above changes
              • in the comment posted at 31/Dec/14 8:59 AM click here , we found router status on client is wrong while unplugging IB interface facing server. But router status on server is correct after unplugging IB interface facing client.
            • This problem could either because router can't set NI status to "DOWN" for unknown reason, or because client/server record wrong NI status for router by mistake. To find out which one is the real reason, first we need to check NI status on router after unplugging NI. Because it will take a few minutes for LNet to detect and mark a NI as down, so we need to sample /proc/sys/lnet/nis for 5 minutes, and sample once per 10 seconds.
              • if NI status stays UP forever, then we know router has defect and it can't change NI status.
              • if NI status turns to DOWN, it's probably a bug on non-router node, could you check /proc/sys/lnet/routes and /proc/sys/lnet/routers on client and server.
            • we also need to know these values in your environment:
              • live_router_check_interval
              • live_router_check_interval
              • router_ping_timeout

            I also have another question:

            • The "unplugging IB interface" at here has any difference with the experiment for LU-6060?

            Isaac,do you have any advice or insight on this issue?

            liang Liang Zhen (Inactive) added a comment - - edited Hi James, because there are many discussions already, I think we should summarise where we are now, please check and correct me if I'm wrong or missed something : there are two LNet patches in your environment http://review.whamcloud.com/#/c/12435 ( LU-5485 lnet: peer aliveness status and NI status) http://review.whamcloud.com/#/c/13162 ( LU-6060 lnet: set downis to 1 if there's no NI for remote net) both patches are applied to all nodes (client, server and router) If there is any other LNet patch, could you post link of patches, or tarball of lnet. ARF can't detect unplugged IB interface on router with above changes in the comment posted at 31/Dec/14 8:59 AM click here , we found router status on client is wrong while unplugging IB interface facing server. But router status on server is correct after unplugging IB interface facing client. This problem could either because router can't set NI status to "DOWN" for unknown reason, or because client/server record wrong NI status for router by mistake. To find out which one is the real reason, first we need to check NI status on router after unplugging NI. Because it will take a few minutes for LNet to detect and mark a NI as down, so we need to sample /proc/sys/lnet/nis for 5 minutes, and sample once per 10 seconds. if NI status stays UP forever, then we know router has defect and it can't change NI status. if NI status turns to DOWN, it's probably a bug on non-router node, could you check /proc/sys/lnet/routes and /proc/sys/lnet/routers on client and server. we also need to know these values in your environment: live_router_check_interval live_router_check_interval router_ping_timeout I also have another question: The "unplugging IB interface" at here has any difference with the experiment for LU-6060 ? Isaac,do you have any advice or insight on this issue?

            People

              liang Liang Zhen (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: