[LU-13571] Refine which network errors result in LNet Health activity Created: 15/May/20  Updated: 23/Feb/21  Resolved: 03/Dec/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Improvement Priority: Major
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Rank (Obsolete): 9223372036854775807
Epic Link: unlabelled-LU-13422

 Description   

There are a category of errors, like unable to resolve address or route which shouldn't result in health of the remote or the local being decremented or recovered. This category of errors indicate that the remote address does not exist or is unreachable.

Rather than ignore these errors we decided that with the enhancement in https://jira.whamcloud.com/browse/LU-13569 we should instead have LND return LNET_MSG_STATUS_NETWORK_TIMEOUT to LNet so that both the local NI and remote NI health is ding'd. This way, if the problem really is with the remote NI then we can have that reflected in the health value for the remote NI and it can be accounted for on future sends. With LU-13569 we don't run the risk of forever recovering a remote NI that will never be returned to service.

Related to this, we decided that the LOCAL_TIMEOUT returned in kiblnd_check_conns() path should also be NETWORK_TIMEOUT:

kiblnd_check_conns()
...
                /* Check tx_deadline */
                list_for_each_entry_safe(tx, tx_tmp, &peer_ni->ibp_tx_queue, tx_list) {
                        if (ktime_compare(ktime_get(), tx->tx_deadline) >= 0) {
                                CWARN("Timed out tx for %s: %lld seconds\n",
                                      libcfs_nid2str(peer_ni->ibp_nid),
                                      ktime_ms_delta(ktime_get(),
                                                     tx->tx_deadline) / MSEC_PER_SEC);
                                list_move(&tx->tx_list, &timedout_txs);
                        }
                }
...
        if (!list_empty(&timedout_txs))
                kiblnd_txlist_done(&timedout_txs, -ETIMEDOUT,
                                   LNET_MSG_STATUS_LOCAL_TIMEOUT);

So for this ticket I plan to push three patches:
1. Modify lnet_health_check() so that NETWORK_TIMEOUT dings both local and remote NI health (this was the original design intent).
2. Modify kiblnd_check_conns() so that it returns NETWORK_TIMEOUT rather than LOCAL_TIMEOUT.
3. Modify the status for unresolvable address or route to return NETWORK_TIMEOUT.

3 probably needs to be based on top of the patches for LU-13569.



 Comments   
Comment by Gerrit Updater [ 14/Sep/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39898
Subject: LU-13571 lnet: Correct handling of NETWORK_TIMEOUT status
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d4a6d8ea328c7b112f0f99027f8acac0c0cf78d5

Comment by Gerrit Updater [ 14/Sep/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39901
Subject: LU-13571 tests: Test health and resends for network timeout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7e6c5ec549010f3e7afa72cfdf690395b5d76e32

Comment by Gerrit Updater [ 14/Sep/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39899
Subject: LU-13571 lnd: Use NETWORK_TIMEOUT for txs on ibp_tx_queue
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 86905008c043bdbdaadde393d68af1906c93ae22

Comment by Gerrit Updater [ 14/Sep/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39900
Subject: LU-13571 lnd: Use NETWORK_TIMEOUT for some conn failures
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ff01a4476feb652757fcccda036b52581bad15dd

Comment by Gerrit Updater [ 17/Sep/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39965
Subject: LU-13571 tests: Debug sanity-lnet test 210
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 36ebf465fb51cdc4fe461a091290fb5dd836c688

Comment by Gerrit Updater [ 26/Nov/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39898/
Subject: LU-13571 lnet: Correct handling of NETWORK_TIMEOUT status
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ffd4523f2d50ef952112f44ffd524af991b4baed

Comment by Gerrit Updater [ 03/Dec/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39899/
Subject: LU-13571 lnd: Use NETWORK_TIMEOUT for txs on ibp_tx_queue
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7af63191370fd2337d0bc9045d211b918c61fdd1

Comment by Gerrit Updater [ 03/Dec/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39900/
Subject: LU-13571 lnd: Use NETWORK_TIMEOUT for some conn failures
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 12333c1fecc00ed67597f189715a68cbfea7b287

Comment by Peter Jones [ 03/Dec/20 ]

Landed for 2.14

Generated at Sat Feb 10 03:02:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.