Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
I noticed that after the MDS hit Inbound wait, it was failing transactions with REMOTE_TIMEOUT health status:
[Wed Mar 8 11:26:29 2023] LNet: 143125:0:(lib-msg.c:814:lnet_health_check()) Message from 17@kfi to 0@kfi exceeded message deadline by 2 seconds rc=-110 hstatus=REMOTE_TIMEOUT
Comes from this code:
static int kfilnd_tn_state_wait_timeout_tag_comp(struct kfilnd_transaction *tn, enum tn_events event, int status, bool *tn_released) { KFILND_TN_DEBUG(tn, "%s event status %d", tn_event_to_str(event), status); switch (event) { case TN_EVENT_TAG_RX_CANCEL: kfilnd_tn_status_update(tn, -ETIMEDOUT, LNET_MSG_STATUS_REMOTE_TIMEOUT); kfilnd_peer_tn_failed(tn->tn_kp, -ETIMEDOUT); break;
Current behavior means that the server never realizes its local NIC is not healthy:
[root@s-lmo-gaz38a ~]# lnetctl net show -v 2 | grep -e nid -e health - nid: 0@lo health stats: health value: 1000 - nid: 172.18.2.3@tcp health stats: health value: 1000 - nid: 17@kfi health stats: health value: 1000 [root@s-lmo-gaz38a ~]#