Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13571

Refine which network errors result in LNet Health activity

Details

    • Improvement
    • Resolution: Fixed
    • Major
    • Lustre 2.14.0
    • None
    • None

    Description

      There are a category of errors, like unable to resolve address or route which shouldn't result in health of the remote or the local being decremented or recovered. This category of errors indicate that the remote address does not exist or is unreachable.

      Rather than ignore these errors we decided that with the enhancement in https://jira.whamcloud.com/browse/LU-13569 we should instead have LND return LNET_MSG_STATUS_NETWORK_TIMEOUT to LNet so that both the local NI and remote NI health is ding'd. This way, if the problem really is with the remote NI then we can have that reflected in the health value for the remote NI and it can be accounted for on future sends. With LU-13569 we don't run the risk of forever recovering a remote NI that will never be returned to service.

      Related to this, we decided that the LOCAL_TIMEOUT returned in kiblnd_check_conns() path should also be NETWORK_TIMEOUT:

      kiblnd_check_conns()
      ...
                      /* Check tx_deadline */
                      list_for_each_entry_safe(tx, tx_tmp, &peer_ni->ibp_tx_queue, tx_list) {
                              if (ktime_compare(ktime_get(), tx->tx_deadline) >= 0) {
                                      CWARN("Timed out tx for %s: %lld seconds\n",
                                            libcfs_nid2str(peer_ni->ibp_nid),
                                            ktime_ms_delta(ktime_get(),
                                                           tx->tx_deadline) / MSEC_PER_SEC);
                                      list_move(&tx->tx_list, &timedout_txs);
                              }
                      }
      ...
              if (!list_empty(&timedout_txs))
                      kiblnd_txlist_done(&timedout_txs, -ETIMEDOUT,
                                         LNET_MSG_STATUS_LOCAL_TIMEOUT);
      

      So for this ticket I plan to push three patches:
      1. Modify lnet_health_check() so that NETWORK_TIMEOUT dings both local and remote NI health (this was the original design intent).
      2. Modify kiblnd_check_conns() so that it returns NETWORK_TIMEOUT rather than LOCAL_TIMEOUT.
      3. Modify the status for unresolvable address or route to return NETWORK_TIMEOUT.

      3 probably needs to be based on top of the patches for LU-13569.

      Attachments

        Activity

          [LU-13571] Refine which network errors result in LNet Health activity
          pjones Peter Jones added a comment -

          Landed for 2.14

          pjones Peter Jones added a comment - Landed for 2.14

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39900/
          Subject: LU-13571 lnd: Use NETWORK_TIMEOUT for some conn failures
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 12333c1fecc00ed67597f189715a68cbfea7b287

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39900/ Subject: LU-13571 lnd: Use NETWORK_TIMEOUT for some conn failures Project: fs/lustre-release Branch: master Current Patch Set: Commit: 12333c1fecc00ed67597f189715a68cbfea7b287

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39899/
          Subject: LU-13571 lnd: Use NETWORK_TIMEOUT for txs on ibp_tx_queue
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 7af63191370fd2337d0bc9045d211b918c61fdd1

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39899/ Subject: LU-13571 lnd: Use NETWORK_TIMEOUT for txs on ibp_tx_queue Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7af63191370fd2337d0bc9045d211b918c61fdd1

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39898/
          Subject: LU-13571 lnet: Correct handling of NETWORK_TIMEOUT status
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: ffd4523f2d50ef952112f44ffd524af991b4baed

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39898/ Subject: LU-13571 lnet: Correct handling of NETWORK_TIMEOUT status Project: fs/lustre-release Branch: master Current Patch Set: Commit: ffd4523f2d50ef952112f44ffd524af991b4baed

          Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39965
          Subject: LU-13571 tests: Debug sanity-lnet test 210
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 36ebf465fb51cdc4fe461a091290fb5dd836c688

          gerrit Gerrit Updater added a comment - Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39965 Subject: LU-13571 tests: Debug sanity-lnet test 210 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 36ebf465fb51cdc4fe461a091290fb5dd836c688

          Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39900
          Subject: LU-13571 lnd: Use NETWORK_TIMEOUT for some conn failures
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: ff01a4476feb652757fcccda036b52581bad15dd

          gerrit Gerrit Updater added a comment - Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39900 Subject: LU-13571 lnd: Use NETWORK_TIMEOUT for some conn failures Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ff01a4476feb652757fcccda036b52581bad15dd

          Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39899
          Subject: LU-13571 lnd: Use NETWORK_TIMEOUT for txs on ibp_tx_queue
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 86905008c043bdbdaadde393d68af1906c93ae22

          gerrit Gerrit Updater added a comment - Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39899 Subject: LU-13571 lnd: Use NETWORK_TIMEOUT for txs on ibp_tx_queue Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 86905008c043bdbdaadde393d68af1906c93ae22

          Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39901
          Subject: LU-13571 tests: Test health and resends for network timeout
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 7e6c5ec549010f3e7afa72cfdf690395b5d76e32

          gerrit Gerrit Updater added a comment - Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39901 Subject: LU-13571 tests: Test health and resends for network timeout Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7e6c5ec549010f3e7afa72cfdf690395b5d76e32

          Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39898
          Subject: LU-13571 lnet: Correct handling of NETWORK_TIMEOUT status
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: d4a6d8ea328c7b112f0f99027f8acac0c0cf78d5

          gerrit Gerrit Updater added a comment - Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39898 Subject: LU-13571 lnet: Correct handling of NETWORK_TIMEOUT status Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d4a6d8ea328c7b112f0f99027f8acac0c0cf78d5

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: