[LU-12041] Fail to set global value with lnetctl import Created: 04/Mar/19  Updated: 16/Oct/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Chris Horn Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: lnet, medium

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The global tunable "retry_count" has a dependency on the global tunable "transaction_timeout". I noticed that when using lnetctl import to configure LNet that retry_count would sometimes fail to be set because it needs to be less than or equal to "transaction_timeout".

Here's the lnet.conf:

sles15build01:~ # cat /tmp/lnet.conf
net:
    - net type: tcp
      local NI(s):
        - interfaces:
              0: eth0
        - interfaces:
              0: eth1
route:
    - net: o2ib
      gateway: 192.168.2.24@tcp
global:
    health_sensitivity: 70
    transaction_timeout: 70
    retry_count: 70
    recovery_interval: 70
    router_sensitivity: 70

Here are the module parameter values before import:

sles15build01:/sys/module/lnet/parameters # lnetctl lnet unconfigure; lustre_rmmod; modprobe lnet; lnetctl lnet configure
sles15build01:/sys/module/lnet/parameters # cd $PWD; for i in lnet_health_sensitivity lnet_recovery_interval lnet_retry_count lnet_transaction_timeout router_sensitivity_percentage; do echo "$i: $(cat $i)"; done
lnet_health_sensitivity: 1
lnet_recovery_interval: 1
lnet_retry_count: 3
lnet_transaction_timeout: 10
router_sensitivity_percentage: 100

And here are the values after import. Note that lnet_retry_count is unchanged:

sles15build01:/sys/module/lnet/parameters # lnetctl import /tmp/lnet.conf
sles15build01:/sys/module/lnet/parameters # cd $PWD; for i in lnet_health_sensitivity lnet_recovery_interval lnet_retry_count lnet_transaction_timeout router_sensitivity_percentage; do echo "$i: $(cat $i)"; done
lnet_health_sensitivity: 70
lnet_recovery_interval: 70
lnet_retry_count: 3
lnet_transaction_timeout: 70
router_sensitivity_percentage: 70
sles15build01:/sys/module/lnet/parameters #

And the following is logged to dmesg:

[257406.875289] LNetError: 11708:0:(api-ni.c:513:retry_count_set()) Invalid value for lnet_retry_count (70). Has to be smaller than lnet_transaction_timeout (10)

Note that while the error message says "Has to be smaller", the code actually allows values less than or equal.

static int
retry_count_set(const char *val, cfs_kernel_param_arg_t *kp)
{
...
    if (value > lnet_transaction_timeout) {
        mutex_unlock(&the_lnet.ln_api_mutex);
        CERROR("Invalid value for lnet_retry_count (%lu). "
               "Has to be smaller than lnet_transaction_timeout (%u)\n",
               value, lnet_transaction_timeout);
        return -EINVAL;
    }


 Comments   
Comment by Chris Horn [ 04/Mar/19 ]

As an aside, I think it would be better if these variables had consistent naming between the yaml and the actual module parameters.

YAML Name             Mod Param Name
health_sensitivity    lnet_health_sensitivity
transaction_timeout   lnet_transaction_timeout
retry_count           lnet_retry_count
recovery_interval     lnet_recovery_interval
router_sensitivity    router_sensitivity_percentage
numa_range            lnet_numa_range
max_intf              lnet_interfaces_max
discovery             lnet_peer_discovery_disabled (has an inverse relationship!?)
Comment by Amir Shehata (Inactive) [ 05/Mar/19 ]

I see the problem there. We should be taking into consideration the dependency between retry_count and transaction_timeout. So when you configure it through YAML, it would configure the transaction_timeout first and then the retry_count.

Regarding the YAML vs Mod Param name, the only concern I have there is that the module param names are longer and might be "too much" to type. If that consensus is that's not a problem, then I don't have a problem of changing that in YAML.

However, note that LNet Health is out in 2.12. I don't think it's widely used yet, so I'm not sure if that's going to be a problem with backwards compatibility.

 

Generated at Sat Feb 10 02:49:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.