[LU-13020] ko2iblnd tuning Created: 26/Nov/19  Updated: 06/Jan/21  Resolved: 06/Jan/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.2
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Amir Shehata (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13145 LNet Health: increase transaction tim... Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

We have been setting ko2iblnd timeout = 150 (default of 50) for our cluster. From reading the code this is no longer being used and instead lnet_lnd_timeout is used.

For example in kiblnd_queue_tx_locked

    timeout_ns = lnet_get_lnd_timeout() * NSEC_PER_SEC;
    tx->tx_queued = 1;
    tx->tx_deadline = ktime_add_ns(ktime_get(), timeout_ns);

and
lnet_get_lnd_timeout() returns the new default of 5. Does this mean we went from 150 to 5!

In the documentation it says that lnet_lnd_timeout derived from lnet_transaction_timeout and retry_count. But that is not getting set for tx->tx_deadline.

Am I reading the code correctly?



 Comments   
Comment by Amir Shehata (Inactive) [ 26/Nov/19 ]

If you look at retry_count_set() and transaction_to_set() you'll see that it's setting the lnet_lnd_timeout based on the following calculation

retry_count_set()
 482 »·······if (value == 0)
 483 »·······»·······lnet_lnd_timeout = lnet_transaction_timeout;
 484 »·······else
 485 »·······»·······lnet_lnd_timeout = lnet_transaction_timeout / value;

transaction_to_set()
 436 »·······*transaction_to = value;
 437 »·······if (lnet_retry_count == 0)
 438 »·······»·······lnet_lnd_timeout = value;
 439 »·······else
 440 »·······»·······lnet_lnd_timeout = value / lnet_retry_count;

I checked master and if you specify

 options lnet lnet_transaction_timeout=150

transaction_to_set() gets called:

 (api-ni.c:442:transaction_to_set()) lnet_lnd_timeout = 50

# it's 50 because retry_count=3
# however if you do this:
options lnet lnet_retry_count=0
options lnet lnet_transaction_timeout=150

# you should see:
(api-ni.c:442:transaction_to_set()) lnet_lnd_timeout = 50
(api-ni.c:489:retry_count_set()) lnet_lnd_timeout = 150

# note the above is debug code I added.

Is this not working for you?

Comment by Mahmoud Hanafi [ 26/Nov/19 ]

So

 
#define LNET_LND_DEFAULT_TIMEOUT 5

is not used? Because when lnet module loads it sets it to 50?

Our cluster running 2.12.2 every where we still get client evictions. And enabling 'debug=+net' make the issue go away (ref: LU-11644)

The other issue, on our remote HDR clustre we are getting RDMA timeout to the lustre routers. Running lnet_selftest with-in compute nodes we get timeouts. This may be a fabric issue. But something is quite not right....
I'll run some tests changing lnet_transaction_timeout to see if I can see some difference.

Comment by Mahmoud Hanafi [ 27/Nov/19 ]

The ko2iblnd timeout should be removed and Docs should be updated. This would eliminate some of the confusion.

Comment by Amir Shehata (Inactive) [ 27/Nov/19 ]

did increasing the timeout resolve the client evictions issue (ref: LU-11644)?

Comment by Mahmoud Hanafi [ 02/Dec/19 ]

I ended up setting

    transaction_timeout: 200
   at_min=275

Things look ok for now. I'll know more in a few days.
I tried setting retry_count=2 with transaction_timeout=200 it causes a lot of timeouts. Worse than setting transaction_timeout=100 and retry_count=0.

Comment by Amir Shehata (Inactive) [ 02/Dec/19 ]

Would it be possible to get some net logging around the timeouts that occur with:

lnet_transaction_timeout=200
lnet_retry_count=2 

It would be very useful for me to understand the cause of these timeouts, so I can fine tune the feature if required.

The idea with this config, is that the LND timeout will be set to 100. We'll attempt 2 retries within that 200s window.

I'd like to see if we do attempt the retry and for what reason and the implications of the retries.

Comment by Gerrit Updater [ 05/Dec/19 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36944
Subject: LU-13020 o2iblnd: timeout is now obsolete
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d6cbd02ff5d17b0cf53683e1c8db8364850ac4cc

Comment by Mahmoud Hanafi [ 10/Dec/19 ]

I will try to reproduce the issue with retry_count=2 on our test filesystem and gather logs.

Comment by Mahmoud Hanafi [ 06/Jan/21 ]

We can close this

Generated at Sat Feb 10 02:57:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.