Details
Description
The global tunable "retry_count" has a dependency on the global tunable "transaction_timeout". I noticed that when using lnetctl import to configure LNet that retry_count would sometimes fail to be set because it needs to be less than or equal to "transaction_timeout".
Here's the lnet.conf:
sles15build01:~ # cat /tmp/lnet.conf net: - net type: tcp local NI(s): - interfaces: 0: eth0 - interfaces: 0: eth1 route: - net: o2ib gateway: 192.168.2.24@tcp global: health_sensitivity: 70 transaction_timeout: 70 retry_count: 70 recovery_interval: 70 router_sensitivity: 70
Here are the module parameter values before import:
sles15build01:/sys/module/lnet/parameters # lnetctl lnet unconfigure; lustre_rmmod; modprobe lnet; lnetctl lnet configure sles15build01:/sys/module/lnet/parameters # cd $PWD; for i in lnet_health_sensitivity lnet_recovery_interval lnet_retry_count lnet_transaction_timeout router_sensitivity_percentage; do echo "$i: $(cat $i)"; done lnet_health_sensitivity: 1 lnet_recovery_interval: 1 lnet_retry_count: 3 lnet_transaction_timeout: 10 router_sensitivity_percentage: 100
And here are the values after import. Note that lnet_retry_count is unchanged:
sles15build01:/sys/module/lnet/parameters # lnetctl import /tmp/lnet.conf sles15build01:/sys/module/lnet/parameters # cd $PWD; for i in lnet_health_sensitivity lnet_recovery_interval lnet_retry_count lnet_transaction_timeout router_sensitivity_percentage; do echo "$i: $(cat $i)"; done lnet_health_sensitivity: 70 lnet_recovery_interval: 70 lnet_retry_count: 3 lnet_transaction_timeout: 70 router_sensitivity_percentage: 70 sles15build01:/sys/module/lnet/parameters #
And the following is logged to dmesg:
[257406.875289] LNetError: 11708:0:(api-ni.c:513:retry_count_set()) Invalid value for lnet_retry_count (70). Has to be smaller than lnet_transaction_timeout (10)
Note that while the error message says "Has to be smaller", the code actually allows values less than or equal.
static int retry_count_set(const char *val, cfs_kernel_param_arg_t *kp) { ... if (value > lnet_transaction_timeout) { mutex_unlock(&the_lnet.ln_api_mutex); CERROR("Invalid value for lnet_retry_count (%lu). " "Has to be smaller than lnet_transaction_timeout (%u)\n", value, lnet_transaction_timeout); return -EINVAL; }