Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12041

Fail to set global value with lnetctl import

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.0, Lustre 2.13.0
    • 3
    • 9223372036854775807

    Description

      The global tunable "retry_count" has a dependency on the global tunable "transaction_timeout". I noticed that when using lnetctl import to configure LNet that retry_count would sometimes fail to be set because it needs to be less than or equal to "transaction_timeout".

      Here's the lnet.conf:

      sles15build01:~ # cat /tmp/lnet.conf
      net:
          - net type: tcp
            local NI(s):
              - interfaces:
                    0: eth0
              - interfaces:
                    0: eth1
      route:
          - net: o2ib
            gateway: 192.168.2.24@tcp
      global:
          health_sensitivity: 70
          transaction_timeout: 70
          retry_count: 70
          recovery_interval: 70
          router_sensitivity: 70
      

      Here are the module parameter values before import:

      sles15build01:/sys/module/lnet/parameters # lnetctl lnet unconfigure; lustre_rmmod; modprobe lnet; lnetctl lnet configure
      sles15build01:/sys/module/lnet/parameters # cd $PWD; for i in lnet_health_sensitivity lnet_recovery_interval lnet_retry_count lnet_transaction_timeout router_sensitivity_percentage; do echo "$i: $(cat $i)"; done
      lnet_health_sensitivity: 1
      lnet_recovery_interval: 1
      lnet_retry_count: 3
      lnet_transaction_timeout: 10
      router_sensitivity_percentage: 100
      

      And here are the values after import. Note that lnet_retry_count is unchanged:

      sles15build01:/sys/module/lnet/parameters # lnetctl import /tmp/lnet.conf
      sles15build01:/sys/module/lnet/parameters # cd $PWD; for i in lnet_health_sensitivity lnet_recovery_interval lnet_retry_count lnet_transaction_timeout router_sensitivity_percentage; do echo "$i: $(cat $i)"; done
      lnet_health_sensitivity: 70
      lnet_recovery_interval: 70
      lnet_retry_count: 3
      lnet_transaction_timeout: 70
      router_sensitivity_percentage: 70
      sles15build01:/sys/module/lnet/parameters #
      

      And the following is logged to dmesg:

      [257406.875289] LNetError: 11708:0:(api-ni.c:513:retry_count_set()) Invalid value for lnet_retry_count (70). Has to be smaller than lnet_transaction_timeout (10)
      

      Note that while the error message says "Has to be smaller", the code actually allows values less than or equal.

      static int
      retry_count_set(const char *val, cfs_kernel_param_arg_t *kp)
      {
      ...
          if (value > lnet_transaction_timeout) {
              mutex_unlock(&the_lnet.ln_api_mutex);
              CERROR("Invalid value for lnet_retry_count (%lu). "
                     "Has to be smaller than lnet_transaction_timeout (%u)\n",
                     value, lnet_transaction_timeout);
              return -EINVAL;
          }
      

      Attachments

        Activity

          [LU-12041] Fail to set global value with lnetctl import

          I see the problem there. We should be taking into consideration the dependency between retry_count and transaction_timeout. So when you configure it through YAML, it would configure the transaction_timeout first and then the retry_count.

          Regarding the YAML vs Mod Param name, the only concern I have there is that the module param names are longer and might be "too much" to type. If that consensus is that's not a problem, then I don't have a problem of changing that in YAML.

          However, note that LNet Health is out in 2.12. I don't think it's widely used yet, so I'm not sure if that's going to be a problem with backwards compatibility.

           

          ashehata Amir Shehata (Inactive) added a comment - I see the problem there. We should be taking into consideration the dependency between retry_count and transaction_timeout. So when you configure it through YAML, it would configure the transaction_timeout first and then the retry_count. Regarding the YAML vs Mod Param name, the only concern I have there is that the module param names are longer and might be "too much" to type. If that consensus is that's not a problem, then I don't have a problem of changing that in YAML. However, note that LNet Health is out in 2.12. I don't think it's widely used yet, so I'm not sure if that's going to be a problem with backwards compatibility.  
          hornc Chris Horn added a comment - - edited

          As an aside, I think it would be better if these variables had consistent naming between the yaml and the actual module parameters.

          YAML Name             Mod Param Name
          health_sensitivity    lnet_health_sensitivity
          transaction_timeout   lnet_transaction_timeout
          retry_count           lnet_retry_count
          recovery_interval     lnet_recovery_interval
          router_sensitivity    router_sensitivity_percentage
          numa_range            lnet_numa_range
          max_intf              lnet_interfaces_max
          discovery             lnet_peer_discovery_disabled (has an inverse relationship!?)
          
          hornc Chris Horn added a comment - - edited As an aside, I think it would be better if these variables had consistent naming between the yaml and the actual module parameters. YAML Name Mod Param Name health_sensitivity lnet_health_sensitivity transaction_timeout lnet_transaction_timeout retry_count lnet_retry_count recovery_interval lnet_recovery_interval router_sensitivity router_sensitivity_percentage numa_range lnet_numa_range max_intf lnet_interfaces_max discovery lnet_peer_discovery_disabled (has an inverse relationship!?)

          People

            core-lustre-triage Core Lustre Triage
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: