Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.12.2
    • None
    • 2
    • 9223372036854775807

    Description

      We have been setting ko2iblnd timeout = 150 (default of 50) for our cluster. From reading the code this is no longer being used and instead lnet_lnd_timeout is used.

      For example in kiblnd_queue_tx_locked

          timeout_ns = lnet_get_lnd_timeout() * NSEC_PER_SEC;
          tx->tx_queued = 1;
          tx->tx_deadline = ktime_add_ns(ktime_get(), timeout_ns);
      

      and
      lnet_get_lnd_timeout() returns the new default of 5. Does this mean we went from 150 to 5!

      In the documentation it says that lnet_lnd_timeout derived from lnet_transaction_timeout and retry_count. But that is not getting set for tx->tx_deadline.

      Am I reading the code correctly?

      Attachments

        Issue Links

          Activity

            [LU-13020] ko2iblnd tuning

            We can close this

            mhanafi Mahmoud Hanafi added a comment - We can close this

            I will try to reproduce the issue with retry_count=2 on our test filesystem and gather logs.

            mhanafi Mahmoud Hanafi added a comment - I will try to reproduce the issue with retry_count=2 on our test filesystem and gather logs.

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36944
            Subject: LU-13020 o2iblnd: timeout is now obsolete
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: d6cbd02ff5d17b0cf53683e1c8db8364850ac4cc

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36944 Subject: LU-13020 o2iblnd: timeout is now obsolete Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d6cbd02ff5d17b0cf53683e1c8db8364850ac4cc

            Would it be possible to get some net logging around the timeouts that occur with:

            lnet_transaction_timeout=200
            lnet_retry_count=2 

            It would be very useful for me to understand the cause of these timeouts, so I can fine tune the feature if required.

            The idea with this config, is that the LND timeout will be set to 100. We'll attempt 2 retries within that 200s window.

            I'd like to see if we do attempt the retry and for what reason and the implications of the retries.

            ashehata Amir Shehata (Inactive) added a comment - Would it be possible to get some net logging around the timeouts that occur with: lnet_transaction_timeout=200 lnet_retry_count=2 It would be very useful for me to understand the cause of these timeouts, so I can fine tune the feature if required. The idea with this config, is that the LND timeout will be set to 100. We'll attempt 2 retries within that 200s window. I'd like to see if we do attempt the retry and for what reason and the implications of the retries.

            I ended up setting

                transaction_timeout: 200
               at_min=275
            

            Things look ok for now. I'll know more in a few days.
            I tried setting retry_count=2 with transaction_timeout=200 it causes a lot of timeouts. Worse than setting transaction_timeout=100 and retry_count=0.

            mhanafi Mahmoud Hanafi added a comment - I ended up setting transaction_timeout: 200 at_min=275 Things look ok for now. I'll know more in a few days. I tried setting retry_count=2 with transaction_timeout=200 it causes a lot of timeouts. Worse than setting transaction_timeout=100 and retry_count=0.

            did increasing the timeout resolve the client evictions issue (ref: LU-11644)?

            ashehata Amir Shehata (Inactive) added a comment - did increasing the timeout resolve the client evictions issue (ref: LU-11644 )?

            The ko2iblnd timeout should be removed and Docs should be updated. This would eliminate some of the confusion.

            mhanafi Mahmoud Hanafi added a comment - The ko2iblnd timeout should be removed and Docs should be updated. This would eliminate some of the confusion.
            mhanafi Mahmoud Hanafi added a comment - - edited

            So

             
            #define LNET_LND_DEFAULT_TIMEOUT 5
            

            is not used? Because when lnet module loads it sets it to 50?

            Our cluster running 2.12.2 every where we still get client evictions. And enabling 'debug=+net' make the issue go away (ref: LU-11644)

            The other issue, on our remote HDR clustre we are getting RDMA timeout to the lustre routers. Running lnet_selftest with-in compute nodes we get timeouts. This may be a fabric issue. But something is quite not right....
            I'll run some tests changing lnet_transaction_timeout to see if I can see some difference.

            mhanafi Mahmoud Hanafi added a comment - - edited So #define LNET_LND_DEFAULT_TIMEOUT 5 is not used? Because when lnet module loads it sets it to 50? Our cluster running 2.12.2 every where we still get client evictions. And enabling 'debug=+net' make the issue go away (ref: LU-11644 ) The other issue, on our remote HDR clustre we are getting RDMA timeout to the lustre routers. Running lnet_selftest with-in compute nodes we get timeouts. This may be a fabric issue. But something is quite not right.... I'll run some tests changing lnet_transaction_timeout to see if I can see some difference.

            If you look at retry_count_set() and transaction_to_set() you'll see that it's setting the lnet_lnd_timeout based on the following calculation

            retry_count_set()
             482 »·······if (value == 0)
             483 »·······»·······lnet_lnd_timeout = lnet_transaction_timeout;
             484 »·······else
             485 »·······»·······lnet_lnd_timeout = lnet_transaction_timeout / value;
            
            transaction_to_set()
             436 »·······*transaction_to = value;
             437 »·······if (lnet_retry_count == 0)
             438 »·······»·······lnet_lnd_timeout = value;
             439 »·······else
             440 »·······»·······lnet_lnd_timeout = value / lnet_retry_count;

            I checked master and if you specify

             options lnet lnet_transaction_timeout=150

            transaction_to_set() gets called:

             (api-ni.c:442:transaction_to_set()) lnet_lnd_timeout = 50
            
            # it's 50 because retry_count=3
            # however if you do this:
            options lnet lnet_retry_count=0
            options lnet lnet_transaction_timeout=150
            
            # you should see:
            (api-ni.c:442:transaction_to_set()) lnet_lnd_timeout = 50
            (api-ni.c:489:retry_count_set()) lnet_lnd_timeout = 150
            
            # note the above is debug code I added.

            Is this not working for you?

            ashehata Amir Shehata (Inactive) added a comment - If you look at retry_count_set() and transaction_to_set() you'll see that it's setting the lnet_lnd_timeout based on the following calculation retry_count_set() 482 »······· if (value == 0) 483 »·······»·······lnet_lnd_timeout = lnet_transaction_timeout; 484 »······· else 485 »·······»·······lnet_lnd_timeout = lnet_transaction_timeout / value; transaction_to_set() 436 »·······*transaction_to = value; 437 »······· if (lnet_retry_count == 0) 438 »·······»·······lnet_lnd_timeout = value; 439 »······· else 440 »·······»·······lnet_lnd_timeout = value / lnet_retry_count; I checked master and if you specify options lnet lnet_transaction_timeout=150 transaction_to_set() gets called: (api-ni.c:442:transaction_to_set()) lnet_lnd_timeout = 50 # it's 50 because retry_count=3 # however if you do this : options lnet lnet_retry_count=0 options lnet lnet_transaction_timeout=150 # you should see: (api-ni.c:442:transaction_to_set()) lnet_lnd_timeout = 50 (api-ni.c:489:retry_count_set()) lnet_lnd_timeout = 150 # note the above is debug code I added. Is this not working for you?

            People

              ashehata Amir Shehata (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: