Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13571

Refine which network errors result in LNet Health activity

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Major
    • Lustre 2.14.0
    • None
    • None

    Description

      There are a category of errors, like unable to resolve address or route which shouldn't result in health of the remote or the local being decremented or recovered. This category of errors indicate that the remote address does not exist or is unreachable.

      Rather than ignore these errors we decided that with the enhancement in https://jira.whamcloud.com/browse/LU-13569 we should instead have LND return LNET_MSG_STATUS_NETWORK_TIMEOUT to LNet so that both the local NI and remote NI health is ding'd. This way, if the problem really is with the remote NI then we can have that reflected in the health value for the remote NI and it can be accounted for on future sends. With LU-13569 we don't run the risk of forever recovering a remote NI that will never be returned to service.

      Related to this, we decided that the LOCAL_TIMEOUT returned in kiblnd_check_conns() path should also be NETWORK_TIMEOUT:

      kiblnd_check_conns()
      ...
                      /* Check tx_deadline */
                      list_for_each_entry_safe(tx, tx_tmp, &peer_ni->ibp_tx_queue, tx_list) {
                              if (ktime_compare(ktime_get(), tx->tx_deadline) >= 0) {
                                      CWARN("Timed out tx for %s: %lld seconds\n",
                                            libcfs_nid2str(peer_ni->ibp_nid),
                                            ktime_ms_delta(ktime_get(),
                                                           tx->tx_deadline) / MSEC_PER_SEC);
                                      list_move(&tx->tx_list, &timedout_txs);
                              }
                      }
      ...
              if (!list_empty(&timedout_txs))
                      kiblnd_txlist_done(&timedout_txs, -ETIMEDOUT,
                                         LNET_MSG_STATUS_LOCAL_TIMEOUT);
      

      So for this ticket I plan to push three patches:
      1. Modify lnet_health_check() so that NETWORK_TIMEOUT dings both local and remote NI health (this was the original design intent).
      2. Modify kiblnd_check_conns() so that it returns NETWORK_TIMEOUT rather than LOCAL_TIMEOUT.
      3. Modify the status for unresolvable address or route to return NETWORK_TIMEOUT.

      3 probably needs to be based on top of the patches for LU-13569.

      Attachments

        Activity

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: