[LU-16990] Use NETWORK_TIMEOUT message status for TN_EVENT_TAG_RX_CANCEL Created: 27/Jul/23  Updated: 22/Aug/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I noticed that after the MDS hit Inbound wait, it was failing transactions with REMOTE_TIMEOUT health status:

[Wed Mar  8 11:26:29 2023] LNet: 143125:0:(lib-msg.c:814:lnet_health_check()) Message from 17@kfi to 0@kfi exceeded message deadline by 2 seconds rc=-110 hstatus=REMOTE_TIMEOUT

Comes from this code:

static int kfilnd_tn_state_wait_timeout_tag_comp(struct kfilnd_transaction *tn,
                                                 enum tn_events event,
                                                 int status, bool *tn_released)
{
        KFILND_TN_DEBUG(tn, "%s event status %d", tn_event_to_str(event),
                        status);

        switch (event) {
        case TN_EVENT_TAG_RX_CANCEL:
                kfilnd_tn_status_update(tn, -ETIMEDOUT,
                                        LNET_MSG_STATUS_REMOTE_TIMEOUT);
                kfilnd_peer_tn_failed(tn->tn_kp, -ETIMEDOUT);
                break;

Current behavior means that the server never realizes its local NIC is not healthy:

[root@s-lmo-gaz38a ~]# lnetctl net show -v 2 | grep -e nid -e health
        - nid: 0@lo
          health stats:
              health value: 1000
        - nid: 172.18.2.3@tcp
          health stats:
              health value: 1000
        - nid: 17@kfi
          health stats:
              health value: 1000
[root@s-lmo-gaz38a ~]#


 Comments   
Comment by Gerrit Updater [ 27/Jul/23 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51782
Subject: LU-16990 kfilnd: Use NETWORK_TIMEOUT for TAG_RX_CANCEL
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6f41c336b7fae58a0dd84d113466501bd413f5dc

Comment by Gerrit Updater [ 22/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51782/
Subject: LU-16990 kfilnd: Use NETWORK_TIMEOUT for TAG_RX_CANCEL
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b16b9d15a53302cae12e7b93816d3eaceee39276

Generated at Sat Feb 10 03:31:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.