Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5758

enabling avoid_asym_router_failure prvents the bring up of ORNL production systems

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.4.3, Lustre 2.5.3
    • Any 2.4/2.5 clients running against 2.4.3 or 2.5.3 servers.
    • 3
    • 16157

    Description

      With the deployment of Lustre 2.5 center wide at ORNL we encountered problems being up the production system due to avoid_sym_router_failure
      being enabled by default. The LNET fabric would fail to come up when enabled. Once it was turned off by default everything returned to normal. This would a useful feature at have for this scale of a system. This problem can be easily reproduced at smaller scale in both non-FGR and FGR set ups.

      Attachments

        Issue Links

          Activity

            [LU-5758] enabling avoid_asym_router_failure prvents the bring up of ORNL production systems
            yujian Jian Yu added a comment -

            Hi James,
            Does the ARF issue still exist? If no, can we close this ticket as resolved?

            yujian Jian Yu added a comment - Hi James, Does the ARF issue still exist? If no, can we close this ticket as resolved?

            it's ok to mix them, in that case client/server just can't find out failed remote NI on routers.

            liang Liang Zhen (Inactive) added a comment - it's ok to mix them, in that case client/server just can't find out failed remote NI on routers.
            yujian Jian Yu added a comment -

            Hi Liang,
            With the two LNet patches you mentioned above, ORNL tested ARF on a small scale cluster and found it worked. They are going to test it on a large scale cluster.

            Now, they have a question about ARF:
            "Does this need to be completely on or completely off, or is it possible to have some clusters have it enabled and not others?"

            Hi James,
            Could you please explain more in case I did not describe the question clearly? Thank you.

            yujian Jian Yu added a comment - Hi Liang, With the two LNet patches you mentioned above, ORNL tested ARF on a small scale cluster and found it worked. They are going to test it on a large scale cluster. Now, they have a question about ARF: "Does this need to be completely on or completely off, or is it possible to have some clusters have it enabled and not others?" Hi James, Could you please explain more in case I did not describe the question clearly? Thank you.
            liang Liang Zhen (Inactive) added a comment - - edited

            Hi James, because there are many discussions already, I think we should summarise where we are now, please check and correct me if I'm wrong or missed something :

            • there are two LNet patches in your environment
            • both patches are applied to all nodes (client, server and router)
            • If there is any other LNet patch, could you post link of patches, or tarball of lnet.
            • ARF can't detect unplugged IB interface on router with above changes
              • in the comment posted at 31/Dec/14 8:59 AM click here , we found router status on client is wrong while unplugging IB interface facing server. But router status on server is correct after unplugging IB interface facing client.
            • This problem could either because router can't set NI status to "DOWN" for unknown reason, or because client/server record wrong NI status for router by mistake. To find out which one is the real reason, first we need to check NI status on router after unplugging NI. Because it will take a few minutes for LNet to detect and mark a NI as down, so we need to sample /proc/sys/lnet/nis for 5 minutes, and sample once per 10 seconds.
              • if NI status stays UP forever, then we know router has defect and it can't change NI status.
              • if NI status turns to DOWN, it's probably a bug on non-router node, could you check /proc/sys/lnet/routes and /proc/sys/lnet/routers on client and server.
            • we also need to know these values in your environment:
              • live_router_check_interval
              • live_router_check_interval
              • router_ping_timeout

            I also have another question:

            • The "unplugging IB interface" at here has any difference with the experiment for LU-6060?

            Isaac,do you have any advice or insight on this issue?

            liang Liang Zhen (Inactive) added a comment - - edited Hi James, because there are many discussions already, I think we should summarise where we are now, please check and correct me if I'm wrong or missed something : there are two LNet patches in your environment http://review.whamcloud.com/#/c/12435 ( LU-5485 lnet: peer aliveness status and NI status) http://review.whamcloud.com/#/c/13162 ( LU-6060 lnet: set downis to 1 if there's no NI for remote net) both patches are applied to all nodes (client, server and router) If there is any other LNet patch, could you post link of patches, or tarball of lnet. ARF can't detect unplugged IB interface on router with above changes in the comment posted at 31/Dec/14 8:59 AM click here , we found router status on client is wrong while unplugging IB interface facing server. But router status on server is correct after unplugging IB interface facing client. This problem could either because router can't set NI status to "DOWN" for unknown reason, or because client/server record wrong NI status for router by mistake. To find out which one is the real reason, first we need to check NI status on router after unplugging NI. Because it will take a few minutes for LNet to detect and mark a NI as down, so we need to sample /proc/sys/lnet/nis for 5 minutes, and sample once per 10 seconds. if NI status stays UP forever, then we know router has defect and it can't change NI status. if NI status turns to DOWN, it's probably a bug on non-router node, could you check /proc/sys/lnet/routes and /proc/sys/lnet/routers on client and server. we also need to know these values in your environment: live_router_check_interval live_router_check_interval router_ping_timeout I also have another question: The "unplugging IB interface" at here has any difference with the experiment for LU-6060 ? Isaac,do you have any advice or insight on this issue?

            Liang I'm setting up the test. I should have results soon. I think you want someone else besides the two above live_router_check_intervals

            simmonsja James A Simmons added a comment - Liang I'm setting up the test. I should have results soon. I think you want someone else besides the two above live_router_check_intervals
            liang Liang Zhen (Inactive) added a comment - - edited

            Yujian, yes we need more logs to find out the reason.

            James, in the next round of test, could you please sample lnet/nis on router for each 10 seconds and recording it for total 300 seconds.

            Also, it is helpful to let us know these values on router: live_router_check_interval, live_router_check_interval, router_ping_timeout. And if possible, could you post your lnet source at here so I can check all patches.

            Thanks in advance.

            liang Liang Zhen (Inactive) added a comment - - edited Yujian, yes we need more logs to find out the reason. James, in the next round of test, could you please sample lnet/nis on router for each 10 seconds and recording it for total 300 seconds. Also, it is helpful to let us know these values on router: live_router_check_interval, live_router_check_interval, router_ping_timeout. And if possible, could you post your lnet source at here so I can check all patches. Thanks in advance.
            yujian Jian Yu added a comment -

            I see, thank you James for the clarification.

            Hi Liang,
            Would you like James to gather more logs to help investigate the ARF issue?

            yujian Jian Yu added a comment - I see, thank you James for the clarification. Hi Liang, Would you like James to gather more logs to help investigate the ARF issue?

            The test were done with 12435. It helped in that we can now mount lustre with ARF but now ARF itself doesn't work.

            simmonsja James A Simmons added a comment - The test were done with 12435. It helped in that we can now mount lustre with ARF but now ARF itself doesn't work.
            yujian Jian Yu added a comment -

            If this is true, I have some changes for LU-5485, which will avoid to use potentially dead router and help this case, but I need more time to make it ready for product.

            Here is the patch created by Liang on Lustre b2_5 branch: http://review.whamcloud.com/12435

            yujian Jian Yu added a comment - If this is true, I have some changes for LU-5485 , which will avoid to use potentially dead router and help this case, but I need more time to make it ready for product. Here is the patch created by Liang on Lustre b2_5 branch: http://review.whamcloud.com/12435

            If I remember right for this set of test we waited for 5 minutes. Then we waited another 5 minutes after that. So it remained in the same state. For the next round of test I will ensure that the logs are taken after the timer expires.

            simmonsja James A Simmons added a comment - If I remember right for this set of test we waited for 5 minutes. Then we waited another 5 minutes after that. So it remained in the same state. For the next round of test I will ensure that the logs are taken after the timer expires.
            liang Liang Zhen (Inactive) added a comment - - edited

            Hi James, because it normally takes about 2 minutes to make a NI as "DOWN" (depends on router_ping_timeout and dead/live_router_check_interval), may I ask how long you waited after unplugging interface, and what are values for those module parameters?

            I checked your logs:

            facing client, last alive news is 42 seconds ago, seems it is not long enough

            nid                                      status alive refs peer  rtr   max    tx   min
            10.38.144.12@o2ib4           up    42    1   63    0   640   640   640
            

            facing server, last alive news is only 16 seconds ago

            nid                                      status alive refs peer  rtr   max    tx   min
            10.36.145.12@o2ib            up    16    3   63    0   640   640   634
            

            Because in previous logs, we saw inconsistent results from server/client (server saw correct NI status on router, but client didn't), so I suspect this test should wait a little longer.
            If this is true, I have some changes for LU-5485, which will avoid to use potentially dead router and help this case, but I need more time to make it ready for product.

            liang Liang Zhen (Inactive) added a comment - - edited Hi James, because it normally takes about 2 minutes to make a NI as "DOWN" (depends on router_ping_timeout and dead/live_router_check_interval), may I ask how long you waited after unplugging interface, and what are values for those module parameters? I checked your logs: facing client, last alive news is 42 seconds ago, seems it is not long enough nid status alive refs peer rtr max tx min 10.38.144.12@o2ib4 up 42 1 63 0 640 640 640 facing server, last alive news is only 16 seconds ago nid status alive refs peer rtr max tx min 10.36.145.12@o2ib up 16 3 63 0 640 640 634 Because in previous logs, we saw inconsistent results from server/client (server saw correct NI status on router, but client didn't), so I suspect this test should wait a little longer. If this is true, I have some changes for LU-5485 , which will avoid to use potentially dead router and help this case, but I need more time to make it ready for product.

            People

              liang Liang Zhen (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: