[LU-13020] ko2iblnd tuning Created: 26/Nov/19 Updated: 06/Jan/21 Resolved: 06/Jan/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 2 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We have been setting ko2iblnd timeout = 150 (default of 50) for our cluster. From reading the code this is no longer being used and instead lnet_lnd_timeout is used. For example in kiblnd_queue_tx_locked
timeout_ns = lnet_get_lnd_timeout() * NSEC_PER_SEC;
tx->tx_queued = 1;
tx->tx_deadline = ktime_add_ns(ktime_get(), timeout_ns);
and In the documentation it says that lnet_lnd_timeout derived from lnet_transaction_timeout and retry_count. But that is not getting set for tx->tx_deadline. Am I reading the code correctly? |
| Comments |
| Comment by Amir Shehata (Inactive) [ 26/Nov/19 ] |
|
If you look at retry_count_set() and transaction_to_set() you'll see that it's setting the lnet_lnd_timeout based on the following calculation retry_count_set() 482 »·······if (value == 0) 483 »·······»·······lnet_lnd_timeout = lnet_transaction_timeout; 484 »·······else 485 »·······»·······lnet_lnd_timeout = lnet_transaction_timeout / value; transaction_to_set() 436 »·······*transaction_to = value; 437 »·······if (lnet_retry_count == 0) 438 »·······»·······lnet_lnd_timeout = value; 439 »·······else 440 »·······»·······lnet_lnd_timeout = value / lnet_retry_count; I checked master and if you specify options lnet lnet_transaction_timeout=150 transaction_to_set() gets called: (api-ni.c:442:transaction_to_set()) lnet_lnd_timeout = 50 # it's 50 because retry_count=3 # however if you do this: options lnet lnet_retry_count=0 options lnet lnet_transaction_timeout=150 # you should see: (api-ni.c:442:transaction_to_set()) lnet_lnd_timeout = 50 (api-ni.c:489:retry_count_set()) lnet_lnd_timeout = 150 # note the above is debug code I added. Is this not working for you? |
| Comment by Mahmoud Hanafi [ 26/Nov/19 ] |
|
So #define LNET_LND_DEFAULT_TIMEOUT 5 is not used? Because when lnet module loads it sets it to 50? Our cluster running 2.12.2 every where we still get client evictions. And enabling 'debug=+net' make the issue go away (ref: LU-11644) The other issue, on our remote HDR clustre we are getting RDMA timeout to the lustre routers. Running lnet_selftest with-in compute nodes we get timeouts. This may be a fabric issue. But something is quite not right.... |
| Comment by Mahmoud Hanafi [ 27/Nov/19 ] |
|
The ko2iblnd timeout should be removed and Docs should be updated. This would eliminate some of the confusion. |
| Comment by Amir Shehata (Inactive) [ 27/Nov/19 ] |
|
did increasing the timeout resolve the client evictions issue (ref: LU-11644)? |
| Comment by Mahmoud Hanafi [ 02/Dec/19 ] |
|
I ended up setting
transaction_timeout: 200
at_min=275
Things look ok for now. I'll know more in a few days. |
| Comment by Amir Shehata (Inactive) [ 02/Dec/19 ] |
|
Would it be possible to get some net logging around the timeouts that occur with: lnet_transaction_timeout=200 lnet_retry_count=2 It would be very useful for me to understand the cause of these timeouts, so I can fine tune the feature if required. The idea with this config, is that the LND timeout will be set to 100. We'll attempt 2 retries within that 200s window. I'd like to see if we do attempt the retry and for what reason and the implications of the retries. |
| Comment by Gerrit Updater [ 05/Dec/19 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36944 |
| Comment by Mahmoud Hanafi [ 10/Dec/19 ] |
|
I will try to reproduce the issue with retry_count=2 on our test filesystem and gather logs. |
| Comment by Mahmoud Hanafi [ 06/Jan/21 ] |
|
We can close this |